Data Inspection and Preprocessing for Liver Cancer Detection¶
This script performs data loading, exploratory data analysis (EDA), and feature selection for a dataset related to liver cancer prediction.
The dataset used comes from the PLCO study, and different sections are analyzed to determine their relevance to the final model.
Author: Juan Armario
Date: 2024
Importing libraries¶
import pandas as pd
import numpy as np
## Others
import warnings
import sys
from collections import Counter
## Plot
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
# Suppress warnings for cleaner output
warnings.filterwarnings('ignore')
# Custom functions
sys.path.append("../../0. Scripts")
import data_analysing_functions as daf
import model_metrics_functions as mmf
Loading Data¶
liver_cancer_df = pd.read_csv('../../0. Data/0. Original/liver_data_mar22_d032222.csv')
pd.set_option('display.max_columns', None)
liver_cancer_df.shape
(154887, 167)
liver_cancer_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 154887 entries, 0 to 154886 Columns: 167 entries, liver_topography to in_TGWAS_population dtypes: float64(130), int64(34), object(3) memory usage: 197.3+ MB
liver_cancer_df
| liver_topography | liver_morphology | liver_grade | liver_behavior | liver_cancer_first | liver_cancer | liver_seer | liver_seercat | liver_annyr | liver_exitstat | liver_exitage | liver_exitdays | liver_cancer_diagdays | plco_id | build | build_cancers | build_incidence_cutoff | educat | marital | occupat | pipe | cigar | sisters | brothers | asp | ibup | fmenstr | menstrs | miscar | tubal | tuballig | bbd | benign_ovcyst | endometriosis | uterine_fib | bq_adminm | lmenstr | trypreg | prega | pregc | stillb | livec | fchilda | hystera | asppd | ibuppd | bcontra | bcontrt | curhorm | thorm | urinatea | enlprosa | infprosa | vasecta | hyperten_f | hearta_f | stroke_f | emphys_f | bronchit_f | diabetes_f | polyps_f | arthrit_f | osteopor_f | divertic_f | gallblad_f | bq_returned | bq_age | race7 | hispanic_f | surg_biopsy | surg_resection | surg_prostatectomy | surg_age | surg_any | preg_f | hyster_f | ovariesr_f | enlpros_f | infpros_f | prosprob_f | urinate_f | vasect_f | bcontr_f | horm_f | horm_stat | smoked_f | smokea_f | rsmoker_f | ssmokea_f | cigpd_f | filtered_f | cig_stat | cig_stop | cig_years | pack_years | bmi_20 | bmi_50 | bmi_curr | bmi_curc | weight_f | weight20_f | weight50_f | height_f | menstrs_stat_type | post_menopausal | bmi_20c | bmi_50c | colon_comorbidity | liver_comorbidity | fh_cancer | liver_fh | liver_fh_cnt | liver_fh_age | bq_compdays | d_dth_liver | f_dth_liver | d_codeath_cat | f_codeath_cat | d_cancersite | f_cancersite | d_seer_death | f_seer_death | is_dead_with_cod | is_dead | mortality_exitage | mortality_exitstat | build_death_cutoff | dth_days | mortality_exitdays | entryage_bq | entryage_dqx | entryage_dhq | entryage_sqx | entryage_muq | ph_any_bq | ph_any_dqx | ph_any_dhq | ph_any_sqx | ph_any_muq | ph_liver_bq | ph_liver_dqx | ph_liver_dhq | ph_liver_sqx | ph_liver_muq | ph_any_trial | ph_liver_trial | liver_eligible_bq | liver_eligible_sqx | liver_eligible_dhq | liver_eligible_dqx | entrydays_bq | entrydays_dqx | entrydays_dhq | entrydays_sqx | entrydays_muq | center | rndyear | arm | sex | age | agelevel | reconsent_outcome | reconsent_outcome_days | fstcan_exitstat | fstcan_exitage | fstcan_exitdays | in_TGWAS_population | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | NaN | NaN | NaN | NaN | NaN | 0 | NaN | NaN | NaN | 8 | 80 | 4794 | NaN | A-000899-7 | mar22/03.22.22 | 1 | 1 | 2.0 | 3.0 | 4.0 | 0.0 | 0.0 | 3.0 | 3.0 | 1.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 4.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1 | 67.0 | 2 | 0.0 | 0.0 | 0.0 | 0.0 | NaN | 0.0 | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | NaN | NaN | NaN | 0.0 | NaN | NaN | NaN | 0.0 | NaN | 0.0 | NaN | 0.0 | 0.0 | 26.383937 | 29.022331 | 25.724339 | 3.0 | 195.0 | 200.0 | 220.0 | 73.0 | NaN | NaN | 3.0 | 3.0 | 0.0 | 0.0 | 1.0 | 9.0 | 0.0 | NaN | -25.0 | 0 | 0 | 900.0 | 900.0 | 999.0 | 999.0 | 50051.0 | 50051.0 | 1 | 1 | 88 | 1 | 4 | 7939.0 | 7939 | 67.0 | NaN | 70.0 | 76.0 | NaN | 0.0 | NaN | 0.0 | 0.0 | NaN | 0.0 | NaN | 0.0 | 0.0 | NaN | 0 | 0 | 1 | 1 | 1 | 0 | 0.0 | NaN | 1067.0 | 3457.0 | NaN | 4 | 1996 | 1 | 1 | 67 | 2 | 2 | 5336 | 8 | 80 | 4794 | 1 |
| 1 | NaN | NaN | NaN | NaN | NaN | 0 | NaN | NaN | NaN | 8 | 72 | 3873 | NaN | A-000989-6 | mar22/03.22.22 | 1 | 1 | 7.0 | 1.0 | 2.0 | 2.0 | 2.0 | 0.0 | 2.0 | 0.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | NaN | NaN | NaN | NaN | 4.0 | NaN | NaN | NaN | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1 | 62.0 | 1 | 0.0 | 0.0 | 0.0 | 0.0 | NaN | 0.0 | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | NaN | NaN | NaN | 1.0 | 21.0 | 0.0 | 29.0 | 5.0 | 1.0 | 2.0 | 33.0 | 8.0 | 24.0 | 22.313033 | 27.891291 | 25.659988 | 3.0 | 184.0 | 160.0 | 200.0 | 71.0 | NaN | NaN | 2.0 | 3.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | NaN | -7.0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | 0 | 0 | 81 | 2 | 4 | NaN | 7160 | 62.0 | 62.0 | 65.0 | NaN | 75.0 | 0.0 | 0.0 | 0.0 | NaN | 0.0 | 0.0 | 0.0 | 0.0 | NaN | 0.0 | 0 | 0 | 1 | 0 | 1 | 1 | 0.0 | 12.0 | 1063.0 | NaN | 5018.0 | 6 | 1999 | 1 | 1 | 62 | 1 | 1 | 4759 | 8 | 72 | 3873 | 1 |
| 2 | NaN | NaN | NaN | NaN | NaN | 0 | NaN | NaN | NaN | 8 | 74 | 4123 | NaN | A-000998-7 | mar22/03.22.22 | 1 | 1 | 5.0 | 1.0 | 4.0 | 0.0 | 0.0 | 1.0 | 2.0 | 1.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2.0 | 0.0 | NaN | NaN | NaN | NaN | 4.0 | NaN | NaN | 2.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1 | 63.0 | 1 | 0.0 | 0.0 | 0.0 | 0.0 | NaN | 0.0 | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 2.0 | 1.0 | NaN | NaN | NaN | 1.0 | 15.0 | 0.0 | 50.0 | 2.0 | 2.0 | 2.0 | 13.0 | 35.0 | 35.0 | 30.680421 | 32.074985 | 34.585201 | 4.0 | 248.0 | 220.0 | 230.0 | 71.0 | NaN | NaN | 4.0 | 4.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | NaN | 15.0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | 0 | 0 | 83 | 2 | 4 | NaN | 7410 | 63.0 | NaN | 64.0 | 70.0 | 77.0 | 0.0 | NaN | 0.0 | 0.0 | 0.0 | 0.0 | NaN | 0.0 | 0.0 | 0.0 | 0 | 0 | 1 | 1 | 1 | 0 | 15.0 | NaN | 702.0 | 2774.0 | 5317.0 | 10 | 1998 | 2 | 1 | 62 | 1 | 1 | 4658 | 8 | 74 | 4123 | 0 |
| 3 | NaN | NaN | NaN | NaN | NaN | 0 | NaN | NaN | NaN | 8 | 87 | 4672 | NaN | A-001799-8 | mar22/03.22.22 | 1 | 1 | 5.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | 2.0 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 3.0 | 0.0 | 3.0 | 4.0 | 0.0 | 4.0 | 3.0 | 3.0 | 0.0 | 0.0 | NaN | 0.0 | 1.0 | 1.0 | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1 | 74.0 | 1 | 0.0 | NaN | NaN | NaN | NaN | NaN | 1.0 | 1.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | 0.0 | 1.0 | 1.0 | 0.0 | NaN | NaN | NaN | 0.0 | NaN | 0.0 | NaN | 0.0 | 0.0 | 20.595703 | 22.312012 | 22.312012 | 2.0 | 130.0 | 120.0 | 130.0 | 64.0 | 3.0 | 1.0 | 2.0 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | NaN | -31.0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | 0 | 0 | 90 | 3 | 4 | NaN | 5621 | 74.0 | NaN | 77.0 | 83.0 | NaN | 0.0 | NaN | 0.0 | 0.0 | NaN | 0.0 | NaN | 0.0 | 0.0 | NaN | 0 | 0 | 1 | 1 | 0 | 0 | 0.0 | NaN | 1064.0 | 3359.0 | NaN | 6 | 1997 | 2 | 2 | 74 | 3 | 5 | 5621 | 8 | 87 | 4672 | 1 |
| 4 | NaN | NaN | NaN | NaN | NaN | 0 | NaN | NaN | NaN | 8 | 72 | 3386 | NaN | A-001889-7 | mar22/03.22.22 | 1 | 1 | 6.0 | 3.0 | 2.0 | 0.0 | 0.0 | 2.0 | 0.0 | 0.0 | 0.0 | 3.0 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 4.0 | 0.0 | 3.0 | 3.0 | 0.0 | 3.0 | 3.0 | NaN | 0.0 | 0.0 | 1.0 | 2.0 | 1.0 | 2.0 | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1 | 63.0 | 1 | 0.0 | NaN | NaN | NaN | NaN | NaN | 1.0 | 0.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | 1.0 | 1.0 | 1.0 | 1.0 | 18.0 | 0.0 | 38.0 | 1.0 | 1.0 | 2.0 | 25.0 | 20.0 | 10.0 | 18.879395 | 24.886475 | 27.460938 | 3.0 | 160.0 | 110.0 | 145.0 | 64.0 | 1.0 | 1.0 | 2.0 | 2.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | NaN | 20.0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | 0 | 0 | 81 | 2 | 4 | NaN | 6673 | 63.0 | NaN | 63.0 | 69.0 | 75.0 | 0.0 | NaN | 0.0 | 0.0 | 0.0 | 0.0 | NaN | 0.0 | 0.0 | 0.0 | 0 | 0 | 1 | 1 | 1 | 0 | 20.0 | NaN | 237.0 | 2322.0 | 4601.0 | 5 | 2000 | 2 | 2 | 63 | 1 | 1 | 4106 | 8 | 72 | 3386 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 154882 | NaN | NaN | NaN | NaN | NaN | 0 | NaN | NaN | NaN | 8 | 69 | 4207 | NaN | Z-162295-2 | mar22/03.22.22 | 1 | 1 | 6.0 | 5.0 | 7.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1 | 58.0 | 4 | 0.0 | 0.0 | 0.0 | 0.0 | NaN | 0.0 | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | NaN | NaN | NaN | 0.0 | NaN | NaN | NaN | 0.0 | NaN | 0.0 | NaN | 0.0 | 0.0 | 25.107143 | 26.541837 | 29.411224 | 3.0 | 205.0 | 175.0 | 185.0 | 70.0 | NaN | NaN | 3.0 | 3.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | NaN | -13.0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | 0 | 0 | 78 | 2 | 4 | NaN | 7494 | 58.0 | NaN | 60.0 | 66.0 | 72.0 | 0.0 | NaN | 0.0 | 0.0 | 0.0 | 0.0 | NaN | 0.0 | 0.0 | 0.0 | 0 | 0 | 1 | 1 | 1 | 0 | 0.0 | NaN | 706.0 | 2855.0 | 5351.0 | 3 | 1998 | 2 | 1 | 58 | 0 | 1 | 4767 | 8 | 69 | 4207 | 1 |
| 154883 | NaN | NaN | NaN | NaN | NaN | 0 | NaN | NaN | NaN | 8 | 77 | 5539 | NaN | Z-162349-7 | mar22/03.22.22 | 1 | 1 | 6.0 | 1.0 | 4.0 | 0.0 | 0.0 | 2.0 | 2.0 | 0.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | NaN | NaN | NaN | NaN | 4.0 | 4.0 | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1 | 62.0 | 1 | 0.0 | 0.0 | 1.0 | 0.0 | 3.0 | 1.0 | NaN | NaN | NaN | 1.0 | 0.0 | 1.0 | 2.0 | 0.0 | NaN | NaN | NaN | 0.0 | NaN | NaN | NaN | 0.0 | NaN | 0.0 | NaN | 0.0 | 0.0 | 20.524438 | 26.605753 | 28.886246 | 3.0 | 190.0 | 135.0 | 175.0 | 68.0 | NaN | NaN | 2.0 | 3.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | NaN | 0.0 | 0 | 0 | 1000.0 | 1000.0 | 999.0 | 999.0 | 60000.0 | 60000.0 | 1 | 1 | 81 | 1 | 4 | 7099.0 | 7099 | 62.0 | NaN | 68.0 | 74.0 | NaN | 0.0 | NaN | 0.0 | 1.0 | NaN | 0.0 | NaN | 0.0 | 0.0 | NaN | 0 | 0 | 1 | 1 | 1 | 0 | 0.0 | NaN | 2113.0 | 4194.0 | NaN | 9 | 1994 | 2 | 1 | 62 | 1 | 3 | 6054 | 1 | 73 | 3994 | 1 |
| 154884 | NaN | NaN | NaN | NaN | NaN | 0 | NaN | NaN | NaN | 5 | 78 | 2152 | NaN | Z-162358-8 | mar22/03.22.22 | 1 | 1 | 5.0 | 1.0 | 4.0 | 0.0 | 0.0 | 2.0 | 6.0 | 0.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | NaN | NaN | NaN | NaN | 4.0 | 5.0 | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1 | 72.0 | 1 | 0.0 | 0.0 | 0.0 | 0.0 | NaN | 0.0 | NaN | NaN | NaN | 1.0 | 0.0 | 1.0 | 3.0 | 0.0 | NaN | NaN | NaN | 1.0 | 16.0 | 0.0 | 58.0 | 3.0 | 2.0 | 2.0 | 14.0 | 42.0 | 63.0 | 21.520408 | 30.128571 | 33.715306 | 4.0 | 235.0 | 150.0 | 210.0 | 70.0 | NaN | NaN | 2.0 | 4.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | NaN | -8.0 | 0 | 0 | 200.0 | 200.0 | 999.0 | 999.0 | 50060.0 | 50060.0 | 1 | 1 | 78 | 1 | 4 | 2152.0 | 2152 | 72.0 | NaN | 74.0 | NaN | NaN | 0.0 | NaN | 0.0 | NaN | NaN | 0.0 | NaN | 0.0 | NaN | NaN | 0 | 0 | 1 | 0 | 1 | 0 | 0.0 | NaN | 711.0 | NaN | NaN | 4 | 1998 | 2 | 1 | 72 | 3 | 12 | 2152 | 5 | 78 | 2152 | 0 |
| 154885 | NaN | NaN | NaN | NaN | NaN | 0 | NaN | NaN | NaN | 5 | 75 | 4524 | NaN | Z-162367-9 | mar22/03.22.22 | 1 | 1 | 3.0 | 1.0 | 4.0 | 0.0 | 0.0 | 1.0 | 4.0 | 1.0 | 1.0 | 5.0 | 2.0 | 0.0 | 0.0 | 1.0 | 0.0 | NaN | 0.0 | 1.0 | 1.0 | 2.0 | 0.0 | 3.0 | 3.0 | 0.0 | 4.0 | 3.0 | 2.0 | 1.0 | 2.0 | 2.0 | 3.0 | 1.0 | 1.0 | NaN | NaN | NaN | NaN | 0.0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1 | 62.0 | 1 | 0.0 | NaN | NaN | NaN | NaN | NaN | 1.0 | 1.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | 1.0 | 1.0 | 1.0 | 1.0 | 20.0 | 1.0 | NaN | 2.0 | 1.0 | 1.0 | 0.0 | 42.0 | 42.0 | 20.985075 | 25.839831 | 27.405881 | 3.0 | 175.0 | 134.0 | 165.0 | 67.0 | 3.0 | 1.0 | 2.0 | 3.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | NaN | -9.0 | 0 | 0 | 100.0 | 100.0 | 14.0 | 14.0 | 14.0 | 14.0 | 1 | 1 | 75 | 1 | 4 | 4524.0 | 4524 | 62.0 | NaN | NaN | NaN | NaN | 0.0 | NaN | NaN | NaN | NaN | 0.0 | NaN | NaN | NaN | NaN | 0 | 0 | 1 | 0 | 0 | 0 | 0.0 | NaN | NaN | NaN | NaN | 4 | 1997 | 2 | 2 | 62 | 1 | 12 | 4524 | 1 | 63 | 250 | 0 |
| 154886 | NaN | NaN | NaN | NaN | NaN | 0 | NaN | NaN | NaN | 5 | 62 | 1579 | NaN | Z-162376-0 | mar22/03.22.22 | 1 | 1 | 3.0 | 1.0 | 6.0 | 0.0 | 0.0 | 5.0 | 2.0 | 0.0 | 1.0 | 3.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 2.0 | 3.0 | 0.0 | 3.0 | 5.0 | 0.0 | 5.0 | 3.0 | NaN | 0.0 | 7.0 | NaN | 0.0 | 0.0 | 0.0 | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1 | 57.0 | 1 | 0.0 | NaN | NaN | NaN | NaN | NaN | 1.0 | 0.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | NaN | NaN | NaN | 0.0 | NaN | 0.0 | NaN | 0.0 | 0.0 | 22.655273 | 25.401367 | 25.229736 | 3.0 | 147.0 | 132.0 | 148.0 | 64.0 | 1.0 | 1.0 | 2.0 | 3.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | NaN | -40.0 | 0 | 0 | 900.0 | 900.0 | 999.0 | 999.0 | 50300.0 | 50300.0 | 1 | 1 | 62 | 1 | 4 | 1579.0 | 1579 | 57.0 | NaN | 61.0 | NaN | NaN | 0.0 | NaN | 0.0 | NaN | NaN | 0.0 | NaN | 0.0 | NaN | NaN | 0 | 0 | 1 | 0 | 1 | 0 | 0.0 | NaN | 1409.0 | NaN | NaN | 4 | 1996 | 2 | 2 | 57 | 0 | 12 | 1579 | 5 | 62 | 1579 | 0 |
154887 rows × 167 columns
Analysing Data¶
In this section, we perform an exploratory data analysis (EDA) to understand the structure, quality, and distribution of variables within the dataset. The dataset contains multiple sections, each representing different types of medical and demographic information.
The main objectives of this analysis include:
- Identifying and categorizing variables to determine their relevance.
- Detecting potential data quality issues such as missing values or inconsistencies.
- Understanding variable distributions and relationships through descriptive statistics and visualizations.
This step is crucial for selecting meaningful features and ensuring that the dataset is well-prepared for model development.
## Creation of diferent groups, according the diferent sections of the data provided and explained in the PLCO guide, to analyse easily
## our variables, to include or not into our model.
section1 = ['plco_id', 'build', 'build_cancers', 'build_death_cutoff', 'build_incidence_cutoff']
section2 = ['ph_liver_trial', 'ph_any_trial', 'in_TGWAS_population']
section3 = ['liver_eligible_bq', 'entryage_bq', 'entrydays_bq', 'ph_liver_bq', 'ph_any_bq']
section4 = ['liver_eligible_dhq', 'entryage_dhq', 'entrydays_dhq', 'ph_liver_dhq', 'ph_any_dhq']
section5 = ['liver_eligible_dqx', 'entryage_dqx', 'entrydays_dqx', 'ph_liver_dqx', 'ph_any_dqx']
section6 = ['liver_eligible_sqx', 'entryage_sqx', 'entrydays_sqx', 'ph_liver_sqx', 'ph_any_sqx']
section7 = ['entryage_muq', 'entrydays_muq', 'ph_liver_muq', 'ph_any_muq']
section8 = ['fstcan_exitstat', 'liver_exitstat', 'fstcan_exitdays', 'liver_exitdays', 'fstcan_exitage', 'liver_exitage', 'mortality_exitstat', 'mortality_exitdays', 'mortality_exitage']
section9 = ['age', 'agelevel', 'arm', 'center', 'rndyear', 'sex']
section10 = ['reconsent_outcome', 'reconsent_outcome_days']
section11 = ['liver_cancer', 'liver_cancer_diagdays', 'liver_cancer_first', 'liver_annyr']
section12 = ['liver_behavior', 'liver_grade', 'liver_morphology', 'liver_topography', 'liver_seer', 'liver_seercat']
section13 = ['is_dead', 'is_dead_with_cod', 'dth_days']
section14 = ['d_seer_death', 'd_cancersite', 'd_dth_liver', 'd_codeath_cat']
section15 = ['f_seer_death', 'f_cancersite', 'f_dth_liver', 'f_codeath_cat']
section16 = ['bq_returned', 'bq_age', 'bq_compdays', 'bq_adminm']
section17 = ['race7', 'hispanic_f', 'educat', 'marital', 'occupat']
section18 = ['cig_stat', 'cig_stop', 'cig_years', 'cigpd_f', 'pack_years', 'cigar', 'filtered_f', 'pipe', 'rsmoker_f', 'smokea_f', 'smoked_f', 'ssmokea_f']
section19 = ['fh_cancer', 'liver_fh', 'liver_fh_age', 'liver_fh_cnt', 'brothers', 'sisters']
section20 = ['bmi_curc', 'bmi_curr', 'height_f', 'weight_f', 'bmi_20', 'bmi_20c', 'weight20_f', 'bmi_50', 'bmi_50c', 'weight50_f']
section21 = ['asp', 'asppd', 'ibup', 'ibuppd']
section22 = ['arthrit_f', 'bronchit_f', 'colon_comorbidity', 'diabetes_f', 'divertic_f', 'emphys_f', 'gallblad_f', 'hearta_f', 'hyperten_f', 'liver_comorbidity', 'osteopor_f', 'polyps_f', 'stroke_f']
section23 = ['hyster_f', 'hystera', 'ovariesr_f', 'tuballig', 'bcontr_f', 'bcontra', 'bcontrt', 'curhorm', 'horm_f', 'horm_stat', 'thorm', 'fchilda', 'livec', 'miscar', 'preg_f', 'prega', 'pregc', 'stillb', 'trypreg', 'tubal', 'fmenstr', 'lmenstr', 'menstrs', 'menstrs_stat_type', 'post_menopausal', 'bbd', 'benign_ovcyst', 'endometriosis', 'uterine_fib']
section24 = ['enlpros_f', 'enlprosa', 'infpros_f', 'infprosa', 'prosprob_f', 'urinate_f', 'urinatea']
section25 = ['surg_age', 'surg_any', 'surg_biopsy', 'surg_prostatectomy', 'surg_resection', 'vasect_f', 'vasecta']
Section 1: Identifiers¶
section1_df = liver_cancer_df[section1]
section1_df
| plco_id | build | build_cancers | build_death_cutoff | build_incidence_cutoff | |
|---|---|---|---|---|---|
| 0 | A-000899-7 | mar22/03.22.22 | 1 | 4 | 1 |
| 1 | A-000989-6 | mar22/03.22.22 | 1 | 4 | 1 |
| 2 | A-000998-7 | mar22/03.22.22 | 1 | 4 | 1 |
| 3 | A-001799-8 | mar22/03.22.22 | 1 | 4 | 1 |
| 4 | A-001889-7 | mar22/03.22.22 | 1 | 4 | 1 |
| ... | ... | ... | ... | ... | ... |
| 154882 | Z-162295-2 | mar22/03.22.22 | 1 | 4 | 1 |
| 154883 | Z-162349-7 | mar22/03.22.22 | 1 | 4 | 1 |
| 154884 | Z-162358-8 | mar22/03.22.22 | 1 | 4 | 1 |
| 154885 | Z-162367-9 | mar22/03.22.22 | 1 | 4 | 1 |
| 154886 | Z-162376-0 | mar22/03.22.22 | 1 | 4 | 1 |
154887 rows × 5 columns
daf.nulls_percentage(section1_df)
plco_id , 0.0% nulls , 154887 unique values, object build , 0.0% nulls , 1 unique values, object build_cancers , 0.0% nulls , 1 unique values, int64 build_death_cutoff , 0.0% nulls , 1 unique values, int64 build_incidence_cutoff , 0.0% nulls , 1 unique values, int64
We can see variables with unique values, which are going to be removed from our model
## Division of the variables in 2 different groups, 'category' and 'numerics'
section1_cat_cols = section1_df.select_dtypes(include = ['object', 'category']).columns
section1_num_cols = section1_df.select_dtypes(exclude = ['object', 'category']).columns
section1_df[section1_cat_cols].describe().T
| count | unique | top | freq | |
|---|---|---|---|---|
| plco_id | 154887 | 154887 | A-000899-7 | 1 |
| build | 154887 | 1 | mar22/03.22.22 | 154887 |
section1_df[section1_num_cols].describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| build_cancers | 154887.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
| build_death_cutoff | 154887.0 | 4.0 | 0.0 | 4.0 | 4.0 | 4.0 | 4.0 | 4.0 |
| build_incidence_cutoff | 154887.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
section1_df[section1_num_cols].hist(figsize=(20,20), bins = 30, xrot=-45 ,)
array([[<Axes: title={'center': 'build_cancers'}>,
<Axes: title={'center': 'build_death_cutoff'}>],
[<Axes: title={'center': 'build_incidence_cutoff'}>, <Axes: >]],
dtype=object)
After analysing the variables in the section 1, I have decided only keep the one called "plco_id", and use it as id, and delete the rest, ('build', 'build_cancers', 'build_death_cutoff', 'build_incidence_cutoff'), because the only take one single value which doesn't affect to the construction of my model.
Section 2: Study¶
section2_df = liver_cancer_df[section2]
section2_df
| ph_liver_trial | ph_any_trial | in_TGWAS_population | |
|---|---|---|---|
| 0 | 0 | 0 | 1 |
| 1 | 0 | 0 | 1 |
| 2 | 0 | 0 | 0 |
| 3 | 0 | 0 | 1 |
| 4 | 0 | 0 | 1 |
| ... | ... | ... | ... |
| 154882 | 0 | 0 | 1 |
| 154883 | 0 | 0 | 1 |
| 154884 | 0 | 0 | 0 |
| 154885 | 0 | 0 | 0 |
| 154886 | 0 | 0 | 0 |
154887 rows × 3 columns
daf.nulls_percentage(section2_df)
ph_liver_trial , 0.0% nulls , 3 unique values, int64 ph_any_trial , 0.0% nulls , 3 unique values, int64 in_TGWAS_population , 0.0% nulls , 2 unique values, int64
section2_df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| ph_liver_trial | 154887.0 | 0.290244 | 1.589667 | 0.0 | 0.0 | 0.0 | 0.0 | 9.0 |
| ph_any_trial | 154887.0 | 0.330880 | 1.585438 | 0.0 | 0.0 | 0.0 | 0.0 | 9.0 |
| in_TGWAS_population | 154887.0 | 0.713824 | 0.451974 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 |
section2_df.hist(figsize=(20,20), bins = 30, xrot=-45 ,)
array([[<Axes: title={'center': 'ph_liver_trial'}>,
<Axes: title={'center': 'ph_any_trial'}>],
[<Axes: title={'center': 'in_TGWAS_population'}>, <Axes: >]],
dtype=object)
section2_df.ph_liver_trial.value_counts()
ph_liver_trial 0 149876 9 4993 1 18 Name: count, dtype: int64
section2_df.ph_any_trial.value_counts()
ph_any_trial 0 143086 1 6870 9 4931 Name: count, dtype: int64
section2_df.in_TGWAS_population.value_counts()
in_TGWAS_population 1 110562 0 44325 Name: count, dtype: int64
Section 3: BQ Eligibility¶
section3_df = liver_cancer_df[section3]
section3_df
| liver_eligible_bq | entryage_bq | entrydays_bq | ph_liver_bq | ph_any_bq | |
|---|---|---|---|---|---|
| 0 | 1 | 67.0 | 0.0 | 0.0 | 0.0 |
| 1 | 1 | 62.0 | 0.0 | 0.0 | 0.0 |
| 2 | 1 | 63.0 | 15.0 | 0.0 | 0.0 |
| 3 | 1 | 74.0 | 0.0 | 0.0 | 0.0 |
| 4 | 1 | 63.0 | 20.0 | 0.0 | 0.0 |
| ... | ... | ... | ... | ... | ... |
| 154882 | 1 | 58.0 | 0.0 | 0.0 | 0.0 |
| 154883 | 1 | 62.0 | 0.0 | 0.0 | 0.0 |
| 154884 | 1 | 72.0 | 0.0 | 0.0 | 0.0 |
| 154885 | 1 | 62.0 | 0.0 | 0.0 | 0.0 |
| 154886 | 1 | 57.0 | 0.0 | 0.0 | 0.0 |
154887 rows × 5 columns
daf.nulls_percentage(section3_df)
liver_eligible_bq , 0.0% nulls , 2 unique values, int64 entryage_bq , 3.2% nulls , 30 unique values, float64 entrydays_bq , 3.2% nulls , 550 unique values, float64 ph_liver_bq , 3.2% nulls , 3 unique values, float64 ph_any_bq , 3.2% nulls , 3 unique values, float64
section3_df.hist(figsize=(20,20), bins = 30)
array([[<Axes: title={'center': 'liver_eligible_bq'}>,
<Axes: title={'center': 'entryage_bq'}>],
[<Axes: title={'center': 'entrydays_bq'}>,
<Axes: title={'center': 'ph_liver_bq'}>],
[<Axes: title={'center': 'ph_any_bq'}>, <Axes: >]], dtype=object)
section3_df.liver_eligible_bq.value_counts()
liver_eligible_bq 1 149369 0 5518 Name: count, dtype: int64
sectionT = ['liver_eligible_bq', 'ph_liver_trial']
sectionT_df = liver_cancer_df[sectionT]
sectionT_df.value_counts()
liver_eligible_bq ph_liver_trial
1 0 149369
0 9 4993
0 507
1 18
Name: count, dtype: int64
We can observe, how in our variable "liver_eligible_bq", the ones that indicates if a participant in our screening questionnaire has a valid data or not, 5518 cases of people whose formulary was rejected. One of the causes of rejection is had a history of cancer prior to the trial.
I am going to delete those patients, (rows), from our dataset.
liver_cancer_df_with_bq = liver_cancer_df.query('liver_eligible_bq != 0')
liver_cancer_df_with_bq
| liver_topography | liver_morphology | liver_grade | liver_behavior | liver_cancer_first | liver_cancer | liver_seer | liver_seercat | liver_annyr | liver_exitstat | liver_exitage | liver_exitdays | liver_cancer_diagdays | plco_id | build | build_cancers | build_incidence_cutoff | educat | marital | occupat | pipe | cigar | sisters | brothers | asp | ibup | fmenstr | menstrs | miscar | tubal | tuballig | bbd | benign_ovcyst | endometriosis | uterine_fib | bq_adminm | lmenstr | trypreg | prega | pregc | stillb | livec | fchilda | hystera | asppd | ibuppd | bcontra | bcontrt | curhorm | thorm | urinatea | enlprosa | infprosa | vasecta | hyperten_f | hearta_f | stroke_f | emphys_f | bronchit_f | diabetes_f | polyps_f | arthrit_f | osteopor_f | divertic_f | gallblad_f | bq_returned | bq_age | race7 | hispanic_f | surg_biopsy | surg_resection | surg_prostatectomy | surg_age | surg_any | preg_f | hyster_f | ovariesr_f | enlpros_f | infpros_f | prosprob_f | urinate_f | vasect_f | bcontr_f | horm_f | horm_stat | smoked_f | smokea_f | rsmoker_f | ssmokea_f | cigpd_f | filtered_f | cig_stat | cig_stop | cig_years | pack_years | bmi_20 | bmi_50 | bmi_curr | bmi_curc | weight_f | weight20_f | weight50_f | height_f | menstrs_stat_type | post_menopausal | bmi_20c | bmi_50c | colon_comorbidity | liver_comorbidity | fh_cancer | liver_fh | liver_fh_cnt | liver_fh_age | bq_compdays | d_dth_liver | f_dth_liver | d_codeath_cat | f_codeath_cat | d_cancersite | f_cancersite | d_seer_death | f_seer_death | is_dead_with_cod | is_dead | mortality_exitage | mortality_exitstat | build_death_cutoff | dth_days | mortality_exitdays | entryage_bq | entryage_dqx | entryage_dhq | entryage_sqx | entryage_muq | ph_any_bq | ph_any_dqx | ph_any_dhq | ph_any_sqx | ph_any_muq | ph_liver_bq | ph_liver_dqx | ph_liver_dhq | ph_liver_sqx | ph_liver_muq | ph_any_trial | ph_liver_trial | liver_eligible_bq | liver_eligible_sqx | liver_eligible_dhq | liver_eligible_dqx | entrydays_bq | entrydays_dqx | entrydays_dhq | entrydays_sqx | entrydays_muq | center | rndyear | arm | sex | age | agelevel | reconsent_outcome | reconsent_outcome_days | fstcan_exitstat | fstcan_exitage | fstcan_exitdays | in_TGWAS_population | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | NaN | NaN | NaN | NaN | NaN | 0 | NaN | NaN | NaN | 8 | 80 | 4794 | NaN | A-000899-7 | mar22/03.22.22 | 1 | 1 | 2.0 | 3.0 | 4.0 | 0.0 | 0.0 | 3.0 | 3.0 | 1.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 4.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1 | 67.0 | 2 | 0.0 | 0.0 | 0.0 | 0.0 | NaN | 0.0 | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | NaN | NaN | NaN | 0.0 | NaN | NaN | NaN | 0.0 | NaN | 0.0 | NaN | 0.0 | 0.0 | 26.383937 | 29.022331 | 25.724339 | 3.0 | 195.0 | 200.0 | 220.0 | 73.0 | NaN | NaN | 3.0 | 3.0 | 0.0 | 0.0 | 1.0 | 9.0 | 0.0 | NaN | -25.0 | 0 | 0 | 900.0 | 900.0 | 999.0 | 999.0 | 50051.0 | 50051.0 | 1 | 1 | 88 | 1 | 4 | 7939.0 | 7939 | 67.0 | NaN | 70.0 | 76.0 | NaN | 0.0 | NaN | 0.0 | 0.0 | NaN | 0.0 | NaN | 0.0 | 0.0 | NaN | 0 | 0 | 1 | 1 | 1 | 0 | 0.0 | NaN | 1067.0 | 3457.0 | NaN | 4 | 1996 | 1 | 1 | 67 | 2 | 2 | 5336 | 8 | 80 | 4794 | 1 |
| 1 | NaN | NaN | NaN | NaN | NaN | 0 | NaN | NaN | NaN | 8 | 72 | 3873 | NaN | A-000989-6 | mar22/03.22.22 | 1 | 1 | 7.0 | 1.0 | 2.0 | 2.0 | 2.0 | 0.0 | 2.0 | 0.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | NaN | NaN | NaN | NaN | 4.0 | NaN | NaN | NaN | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1 | 62.0 | 1 | 0.0 | 0.0 | 0.0 | 0.0 | NaN | 0.0 | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | NaN | NaN | NaN | 1.0 | 21.0 | 0.0 | 29.0 | 5.0 | 1.0 | 2.0 | 33.0 | 8.0 | 24.0 | 22.313033 | 27.891291 | 25.659988 | 3.0 | 184.0 | 160.0 | 200.0 | 71.0 | NaN | NaN | 2.0 | 3.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | NaN | -7.0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | 0 | 0 | 81 | 2 | 4 | NaN | 7160 | 62.0 | 62.0 | 65.0 | NaN | 75.0 | 0.0 | 0.0 | 0.0 | NaN | 0.0 | 0.0 | 0.0 | 0.0 | NaN | 0.0 | 0 | 0 | 1 | 0 | 1 | 1 | 0.0 | 12.0 | 1063.0 | NaN | 5018.0 | 6 | 1999 | 1 | 1 | 62 | 1 | 1 | 4759 | 8 | 72 | 3873 | 1 |
| 2 | NaN | NaN | NaN | NaN | NaN | 0 | NaN | NaN | NaN | 8 | 74 | 4123 | NaN | A-000998-7 | mar22/03.22.22 | 1 | 1 | 5.0 | 1.0 | 4.0 | 0.0 | 0.0 | 1.0 | 2.0 | 1.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2.0 | 0.0 | NaN | NaN | NaN | NaN | 4.0 | NaN | NaN | 2.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1 | 63.0 | 1 | 0.0 | 0.0 | 0.0 | 0.0 | NaN | 0.0 | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 2.0 | 1.0 | NaN | NaN | NaN | 1.0 | 15.0 | 0.0 | 50.0 | 2.0 | 2.0 | 2.0 | 13.0 | 35.0 | 35.0 | 30.680421 | 32.074985 | 34.585201 | 4.0 | 248.0 | 220.0 | 230.0 | 71.0 | NaN | NaN | 4.0 | 4.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | NaN | 15.0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | 0 | 0 | 83 | 2 | 4 | NaN | 7410 | 63.0 | NaN | 64.0 | 70.0 | 77.0 | 0.0 | NaN | 0.0 | 0.0 | 0.0 | 0.0 | NaN | 0.0 | 0.0 | 0.0 | 0 | 0 | 1 | 1 | 1 | 0 | 15.0 | NaN | 702.0 | 2774.0 | 5317.0 | 10 | 1998 | 2 | 1 | 62 | 1 | 1 | 4658 | 8 | 74 | 4123 | 0 |
| 3 | NaN | NaN | NaN | NaN | NaN | 0 | NaN | NaN | NaN | 8 | 87 | 4672 | NaN | A-001799-8 | mar22/03.22.22 | 1 | 1 | 5.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | 2.0 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 3.0 | 0.0 | 3.0 | 4.0 | 0.0 | 4.0 | 3.0 | 3.0 | 0.0 | 0.0 | NaN | 0.0 | 1.0 | 1.0 | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1 | 74.0 | 1 | 0.0 | NaN | NaN | NaN | NaN | NaN | 1.0 | 1.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | 0.0 | 1.0 | 1.0 | 0.0 | NaN | NaN | NaN | 0.0 | NaN | 0.0 | NaN | 0.0 | 0.0 | 20.595703 | 22.312012 | 22.312012 | 2.0 | 130.0 | 120.0 | 130.0 | 64.0 | 3.0 | 1.0 | 2.0 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | NaN | -31.0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | 0 | 0 | 90 | 3 | 4 | NaN | 5621 | 74.0 | NaN | 77.0 | 83.0 | NaN | 0.0 | NaN | 0.0 | 0.0 | NaN | 0.0 | NaN | 0.0 | 0.0 | NaN | 0 | 0 | 1 | 1 | 0 | 0 | 0.0 | NaN | 1064.0 | 3359.0 | NaN | 6 | 1997 | 2 | 2 | 74 | 3 | 5 | 5621 | 8 | 87 | 4672 | 1 |
| 4 | NaN | NaN | NaN | NaN | NaN | 0 | NaN | NaN | NaN | 8 | 72 | 3386 | NaN | A-001889-7 | mar22/03.22.22 | 1 | 1 | 6.0 | 3.0 | 2.0 | 0.0 | 0.0 | 2.0 | 0.0 | 0.0 | 0.0 | 3.0 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 4.0 | 0.0 | 3.0 | 3.0 | 0.0 | 3.0 | 3.0 | NaN | 0.0 | 0.0 | 1.0 | 2.0 | 1.0 | 2.0 | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1 | 63.0 | 1 | 0.0 | NaN | NaN | NaN | NaN | NaN | 1.0 | 0.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | 1.0 | 1.0 | 1.0 | 1.0 | 18.0 | 0.0 | 38.0 | 1.0 | 1.0 | 2.0 | 25.0 | 20.0 | 10.0 | 18.879395 | 24.886475 | 27.460938 | 3.0 | 160.0 | 110.0 | 145.0 | 64.0 | 1.0 | 1.0 | 2.0 | 2.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | NaN | 20.0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | 0 | 0 | 81 | 2 | 4 | NaN | 6673 | 63.0 | NaN | 63.0 | 69.0 | 75.0 | 0.0 | NaN | 0.0 | 0.0 | 0.0 | 0.0 | NaN | 0.0 | 0.0 | 0.0 | 0 | 0 | 1 | 1 | 1 | 0 | 20.0 | NaN | 237.0 | 2322.0 | 4601.0 | 5 | 2000 | 2 | 2 | 63 | 1 | 1 | 4106 | 8 | 72 | 3386 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 154882 | NaN | NaN | NaN | NaN | NaN | 0 | NaN | NaN | NaN | 8 | 69 | 4207 | NaN | Z-162295-2 | mar22/03.22.22 | 1 | 1 | 6.0 | 5.0 | 7.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1 | 58.0 | 4 | 0.0 | 0.0 | 0.0 | 0.0 | NaN | 0.0 | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | NaN | NaN | NaN | 0.0 | NaN | NaN | NaN | 0.0 | NaN | 0.0 | NaN | 0.0 | 0.0 | 25.107143 | 26.541837 | 29.411224 | 3.0 | 205.0 | 175.0 | 185.0 | 70.0 | NaN | NaN | 3.0 | 3.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | NaN | -13.0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN | 0 | 0 | 78 | 2 | 4 | NaN | 7494 | 58.0 | NaN | 60.0 | 66.0 | 72.0 | 0.0 | NaN | 0.0 | 0.0 | 0.0 | 0.0 | NaN | 0.0 | 0.0 | 0.0 | 0 | 0 | 1 | 1 | 1 | 0 | 0.0 | NaN | 706.0 | 2855.0 | 5351.0 | 3 | 1998 | 2 | 1 | 58 | 0 | 1 | 4767 | 8 | 69 | 4207 | 1 |
| 154883 | NaN | NaN | NaN | NaN | NaN | 0 | NaN | NaN | NaN | 8 | 77 | 5539 | NaN | Z-162349-7 | mar22/03.22.22 | 1 | 1 | 6.0 | 1.0 | 4.0 | 0.0 | 0.0 | 2.0 | 2.0 | 0.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | NaN | NaN | NaN | NaN | 4.0 | 4.0 | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1 | 62.0 | 1 | 0.0 | 0.0 | 1.0 | 0.0 | 3.0 | 1.0 | NaN | NaN | NaN | 1.0 | 0.0 | 1.0 | 2.0 | 0.0 | NaN | NaN | NaN | 0.0 | NaN | NaN | NaN | 0.0 | NaN | 0.0 | NaN | 0.0 | 0.0 | 20.524438 | 26.605753 | 28.886246 | 3.0 | 190.0 | 135.0 | 175.0 | 68.0 | NaN | NaN | 2.0 | 3.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | NaN | 0.0 | 0 | 0 | 1000.0 | 1000.0 | 999.0 | 999.0 | 60000.0 | 60000.0 | 1 | 1 | 81 | 1 | 4 | 7099.0 | 7099 | 62.0 | NaN | 68.0 | 74.0 | NaN | 0.0 | NaN | 0.0 | 1.0 | NaN | 0.0 | NaN | 0.0 | 0.0 | NaN | 0 | 0 | 1 | 1 | 1 | 0 | 0.0 | NaN | 2113.0 | 4194.0 | NaN | 9 | 1994 | 2 | 1 | 62 | 1 | 3 | 6054 | 1 | 73 | 3994 | 1 |
| 154884 | NaN | NaN | NaN | NaN | NaN | 0 | NaN | NaN | NaN | 5 | 78 | 2152 | NaN | Z-162358-8 | mar22/03.22.22 | 1 | 1 | 5.0 | 1.0 | 4.0 | 0.0 | 0.0 | 2.0 | 6.0 | 0.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | NaN | NaN | NaN | NaN | 4.0 | 5.0 | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1 | 72.0 | 1 | 0.0 | 0.0 | 0.0 | 0.0 | NaN | 0.0 | NaN | NaN | NaN | 1.0 | 0.0 | 1.0 | 3.0 | 0.0 | NaN | NaN | NaN | 1.0 | 16.0 | 0.0 | 58.0 | 3.0 | 2.0 | 2.0 | 14.0 | 42.0 | 63.0 | 21.520408 | 30.128571 | 33.715306 | 4.0 | 235.0 | 150.0 | 210.0 | 70.0 | NaN | NaN | 2.0 | 4.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | NaN | -8.0 | 0 | 0 | 200.0 | 200.0 | 999.0 | 999.0 | 50060.0 | 50060.0 | 1 | 1 | 78 | 1 | 4 | 2152.0 | 2152 | 72.0 | NaN | 74.0 | NaN | NaN | 0.0 | NaN | 0.0 | NaN | NaN | 0.0 | NaN | 0.0 | NaN | NaN | 0 | 0 | 1 | 0 | 1 | 0 | 0.0 | NaN | 711.0 | NaN | NaN | 4 | 1998 | 2 | 1 | 72 | 3 | 12 | 2152 | 5 | 78 | 2152 | 0 |
| 154885 | NaN | NaN | NaN | NaN | NaN | 0 | NaN | NaN | NaN | 5 | 75 | 4524 | NaN | Z-162367-9 | mar22/03.22.22 | 1 | 1 | 3.0 | 1.0 | 4.0 | 0.0 | 0.0 | 1.0 | 4.0 | 1.0 | 1.0 | 5.0 | 2.0 | 0.0 | 0.0 | 1.0 | 0.0 | NaN | 0.0 | 1.0 | 1.0 | 2.0 | 0.0 | 3.0 | 3.0 | 0.0 | 4.0 | 3.0 | 2.0 | 1.0 | 2.0 | 2.0 | 3.0 | 1.0 | 1.0 | NaN | NaN | NaN | NaN | 0.0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1 | 62.0 | 1 | 0.0 | NaN | NaN | NaN | NaN | NaN | 1.0 | 1.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | 1.0 | 1.0 | 1.0 | 1.0 | 20.0 | 1.0 | NaN | 2.0 | 1.0 | 1.0 | 0.0 | 42.0 | 42.0 | 20.985075 | 25.839831 | 27.405881 | 3.0 | 175.0 | 134.0 | 165.0 | 67.0 | 3.0 | 1.0 | 2.0 | 3.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | NaN | -9.0 | 0 | 0 | 100.0 | 100.0 | 14.0 | 14.0 | 14.0 | 14.0 | 1 | 1 | 75 | 1 | 4 | 4524.0 | 4524 | 62.0 | NaN | NaN | NaN | NaN | 0.0 | NaN | NaN | NaN | NaN | 0.0 | NaN | NaN | NaN | NaN | 0 | 0 | 1 | 0 | 0 | 0 | 0.0 | NaN | NaN | NaN | NaN | 4 | 1997 | 2 | 2 | 62 | 1 | 12 | 4524 | 1 | 63 | 250 | 0 |
| 154886 | NaN | NaN | NaN | NaN | NaN | 0 | NaN | NaN | NaN | 5 | 62 | 1579 | NaN | Z-162376-0 | mar22/03.22.22 | 1 | 1 | 3.0 | 1.0 | 6.0 | 0.0 | 0.0 | 5.0 | 2.0 | 0.0 | 1.0 | 3.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 2.0 | 3.0 | 0.0 | 3.0 | 5.0 | 0.0 | 5.0 | 3.0 | NaN | 0.0 | 7.0 | NaN | 0.0 | 0.0 | 0.0 | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1 | 57.0 | 1 | 0.0 | NaN | NaN | NaN | NaN | NaN | 1.0 | 0.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | NaN | NaN | NaN | 0.0 | NaN | 0.0 | NaN | 0.0 | 0.0 | 22.655273 | 25.401367 | 25.229736 | 3.0 | 147.0 | 132.0 | 148.0 | 64.0 | 1.0 | 1.0 | 2.0 | 3.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | NaN | -40.0 | 0 | 0 | 900.0 | 900.0 | 999.0 | 999.0 | 50300.0 | 50300.0 | 1 | 1 | 62 | 1 | 4 | 1579.0 | 1579 | 57.0 | NaN | 61.0 | NaN | NaN | 0.0 | NaN | 0.0 | NaN | NaN | 0.0 | NaN | 0.0 | NaN | NaN | 0 | 0 | 1 | 0 | 1 | 0 | 0.0 | NaN | 1409.0 | NaN | NaN | 4 | 1996 | 2 | 2 | 57 | 0 | 12 | 1579 | 5 | 62 | 1579 | 0 |
149369 rows × 167 columns
From now on, I will use the dataset without those rows, called 'liver_cancer_df_with_bq'. However, any of the variables inside this section will be considered for my final model.
Section 4: DHQ Eligibility¶
I am only consider the data that comes from the BQ formulary. In that formulary we have the variables that we need, so I won't take these variables in my model.
Section 5: DQX Eligibility¶
I am only consider the data that comes from the BQ formulary. In that formulary we have the variables that we need, so I won't take these variables in my model.
Section 6: SQX Eligibility¶
I am only consider the data that comes from the BQ formulary. In that formulary we have the variables that we need, so I won't take these variables in my model.
Section 7: MUQ Eligibility¶
I am only consider the data that comes from the BQ formulary. In that formulary we have the variables that we need, so I won't take these variables in my model.
Section 8: Exit¶
section8_df = liver_cancer_df_with_bq[section8]
section8_df
| fstcan_exitstat | liver_exitstat | fstcan_exitdays | liver_exitdays | fstcan_exitage | liver_exitage | mortality_exitstat | mortality_exitdays | mortality_exitage | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 8 | 8 | 4794 | 4794 | 80 | 80 | 1 | 7939 | 88 |
| 1 | 8 | 8 | 3873 | 3873 | 72 | 72 | 2 | 7160 | 81 |
| 2 | 8 | 8 | 4123 | 4123 | 74 | 74 | 2 | 7410 | 83 |
| 3 | 8 | 8 | 4672 | 4672 | 87 | 87 | 3 | 5621 | 90 |
| 4 | 8 | 8 | 3386 | 3386 | 72 | 72 | 2 | 6673 | 81 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 154882 | 8 | 8 | 4207 | 4207 | 69 | 69 | 2 | 7494 | 78 |
| 154883 | 1 | 8 | 3994 | 5539 | 73 | 77 | 1 | 7099 | 81 |
| 154884 | 5 | 5 | 2152 | 2152 | 78 | 78 | 1 | 2152 | 78 |
| 154885 | 1 | 5 | 250 | 4524 | 63 | 75 | 1 | 4524 | 75 |
| 154886 | 5 | 5 | 1579 | 1579 | 62 | 62 | 1 | 1579 | 62 |
149369 rows × 9 columns
daf.nulls_percentage(section8_df)
fstcan_exitstat , 0.0% nulls , 8 unique values, int64 liver_exitstat , 0.0% nulls , 8 unique values, int64 fstcan_exitdays , 0.0% nulls , 5903 unique values, int64 liver_exitdays , 0.0% nulls , 5823 unique values, int64 fstcan_exitage , 0.0% nulls , 38 unique values, int64 liver_exitage , 0.0% nulls , 37 unique values, int64 mortality_exitstat , 0.0% nulls , 4 unique values, int64 mortality_exitdays , 0.0% nulls , 9050 unique values, int64 mortality_exitage , 0.0% nulls , 45 unique values, int64
In this section, since my study is about the liver cancer, I am not going to consider the first cancer incidence variables. Then, I will delete from this section the paramaters 'fstcan_exitstat', 'fstcan_exitdays', 'fstcan_exitage'.
section8_df = section8_df.drop(['fstcan_exitstat', 'fstcan_exitdays', 'fstcan_exitage'], axis=1)
section8_df
| liver_exitstat | liver_exitdays | liver_exitage | mortality_exitstat | mortality_exitdays | mortality_exitage | |
|---|---|---|---|---|---|---|
| 0 | 8 | 4794 | 80 | 1 | 7939 | 88 |
| 1 | 8 | 3873 | 72 | 2 | 7160 | 81 |
| 2 | 8 | 4123 | 74 | 2 | 7410 | 83 |
| 3 | 8 | 4672 | 87 | 3 | 5621 | 90 |
| 4 | 8 | 3386 | 72 | 2 | 6673 | 81 |
| ... | ... | ... | ... | ... | ... | ... |
| 154882 | 8 | 4207 | 69 | 2 | 7494 | 78 |
| 154883 | 8 | 5539 | 77 | 1 | 7099 | 81 |
| 154884 | 5 | 2152 | 78 | 1 | 2152 | 78 |
| 154885 | 5 | 4524 | 75 | 1 | 4524 | 75 |
| 154886 | 5 | 1579 | 62 | 1 | 1579 | 62 |
149369 rows × 6 columns
to_remove = ['fstcan_exitstat', 'fstcan_exitdays', 'fstcan_exitage']
section8 = [item for item in section8 if item not in to_remove]
print(section8)
['liver_exitstat', 'liver_exitdays', 'liver_exitage', 'mortality_exitstat', 'mortality_exitdays', 'mortality_exitage']
section8_df.hist(figsize=(20,20), bins = 30, xrot=-45 ,)
array([[<Axes: title={'center': 'liver_exitstat'}>,
<Axes: title={'center': 'liver_exitdays'}>],
[<Axes: title={'center': 'liver_exitage'}>,
<Axes: title={'center': 'mortality_exitstat'}>],
[<Axes: title={'center': 'mortality_exitdays'}>,
<Axes: title={'center': 'mortality_exitage'}>]], dtype=object)
plt.figure(figsize=(20,20))
sns.heatmap(liver_cancer_df_with_bq[section8].corr(), cmap='RdBu_r', annot = True)
<Axes: >
to_remove = ['liver_exitdays', 'mortality_exitdays']
section8 = [item for item in section8 if item not in to_remove]
print(section8)
['liver_exitstat', 'liver_exitage', 'mortality_exitstat', 'mortality_exitage']
plt.figure(figsize=(20,20))
sns.heatmap(liver_cancer_df_with_bq[section8].corr(), cmap='RdBu_r', annot = True)
<Axes: >
Section 9: Demographics at Trial Entry¶
section9_df = liver_cancer_df_with_bq[section9]
section9_df
| age | agelevel | arm | center | rndyear | sex | |
|---|---|---|---|---|---|---|
| 0 | 67 | 2 | 1 | 4 | 1996 | 1 |
| 1 | 62 | 1 | 1 | 6 | 1999 | 1 |
| 2 | 62 | 1 | 2 | 10 | 1998 | 1 |
| 3 | 74 | 3 | 2 | 6 | 1997 | 2 |
| 4 | 63 | 1 | 2 | 5 | 2000 | 2 |
| ... | ... | ... | ... | ... | ... | ... |
| 154882 | 58 | 0 | 2 | 3 | 1998 | 1 |
| 154883 | 62 | 1 | 2 | 9 | 1994 | 1 |
| 154884 | 72 | 3 | 2 | 4 | 1998 | 1 |
| 154885 | 62 | 1 | 2 | 4 | 1997 | 2 |
| 154886 | 57 | 0 | 2 | 4 | 1996 | 2 |
149369 rows × 6 columns
daf.nulls_percentage(section9_df)
age , 0.0% nulls , 30 unique values, int64 agelevel , 0.0% nulls , 4 unique values, int64 arm , 0.0% nulls , 2 unique values, int64 center , 0.0% nulls , 10 unique values, int64 rndyear , 0.0% nulls , 9 unique values, int64 sex , 0.0% nulls , 2 unique values, int64
section9_df.hist(figsize=(20,20), bins = 30, xrot=-45)
array([[<Axes: title={'center': 'age'}>,
<Axes: title={'center': 'agelevel'}>],
[<Axes: title={'center': 'arm'}>,
<Axes: title={'center': 'center'}>],
[<Axes: title={'center': 'rndyear'}>,
<Axes: title={'center': 'sex'}>]], dtype=object)
plt.figure(figsize=(20,20))
sns.heatmap(liver_cancer_df_with_bq[section9].corr(), cmap='RdBu_r', annot = True)
<Axes: >
We can observe the variables 'age' and 'agelevel', are highly correlated. Which makes sense, because both represent the age at trial entry, but agelevel is created from different categories. For our model I am going to consider only 'agelevel', because we have less different values.
Section 10: Re-consent¶
section10_df = liver_cancer_df_with_bq[section10]
section10_df
| reconsent_outcome | reconsent_outcome_days | |
|---|---|---|
| 0 | 2 | 5336 |
| 1 | 1 | 4759 |
| 2 | 1 | 4658 |
| 3 | 5 | 5621 |
| 4 | 1 | 4106 |
| ... | ... | ... |
| 154882 | 1 | 4767 |
| 154883 | 3 | 6054 |
| 154884 | 12 | 2152 |
| 154885 | 12 | 4524 |
| 154886 | 12 | 1579 |
149369 rows × 2 columns
daf.nulls_percentage(section10_df)
reconsent_outcome , 0.0% nulls , 10 unique values, int64 reconsent_outcome_days , 0.0% nulls , 6649 unique values, int64
section10_df.hist(figsize=(20,10), bins = 30, xrot=-45)
array([[<Axes: title={'center': 'reconsent_outcome'}>,
<Axes: title={'center': 'reconsent_outcome_days'}>]], dtype=object)
These variables are not going to be considered to the model, because they don't give information more than information about consent to centralized follow-up. However, the important data, such as, liver cancer, cause of death, etc. are already present in our study.
Section 11: Cancer Diagnosis¶
section11_df = liver_cancer_df_with_bq[section11]
section11_df
| liver_cancer | liver_cancer_diagdays | liver_cancer_first | liver_annyr | |
|---|---|---|---|---|
| 0 | 0 | NaN | NaN | NaN |
| 1 | 0 | NaN | NaN | NaN |
| 2 | 0 | NaN | NaN | NaN |
| 3 | 0 | NaN | NaN | NaN |
| 4 | 0 | NaN | NaN | NaN |
| ... | ... | ... | ... | ... |
| 154882 | 0 | NaN | NaN | NaN |
| 154883 | 0 | NaN | NaN | NaN |
| 154884 | 0 | NaN | NaN | NaN |
| 154885 | 0 | NaN | NaN | NaN |
| 154886 | 0 | NaN | NaN | NaN |
149369 rows × 4 columns
daf.nulls_percentage(section11_df)
liver_cancer , 0.0% nulls , 2 unique values, int64 liver_cancer_diagdays , 99.9% nulls , 218 unique values, float64 liver_cancer_first , 99.9% nulls , 2 unique values, float64 liver_annyr , 99.9% nulls , 15 unique values, float64
section11_df.hist(figsize=(20,20), bins = 30, xrot=-45)
array([[<Axes: title={'center': 'liver_cancer'}>,
<Axes: title={'center': 'liver_cancer_diagdays'}>],
[<Axes: title={'center': 'liver_cancer_first'}>,
<Axes: title={'center': 'liver_annyr'}>]], dtype=object)
plt.figure(figsize=(20,20))
sns.heatmap(liver_cancer_df_with_bq[section11].corr(), cmap='RdBu_r', annot = True)
<Axes: >
In this section we have the variables about the cancer diagnosis.
First, I have 'liver_cancer' which will be my target variable. I will study separately later in deep, but I will mention for now how unbalanced is this, so I will solve this problem because has a huge impact on our study.
Second, we can see how correlated are variables 'liver_cancer_diagdays' and 'liver_annyr', both representing information about how long the cancer was diagnosed. For my model, since I am going to study the appearance of liver cancer, I won't take into consideration how long the patient suffered cancer, so I will delete both from my characteristics.
Section 12: Cancer Characteristics¶
section12_df = liver_cancer_df_with_bq[section12]
section12_df
| liver_behavior | liver_grade | liver_morphology | liver_topography | liver_seer | liver_seercat | |
|---|---|---|---|---|---|---|
| 0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | NaN | NaN | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... |
| 154882 | NaN | NaN | NaN | NaN | NaN | NaN |
| 154883 | NaN | NaN | NaN | NaN | NaN | NaN |
| 154884 | NaN | NaN | NaN | NaN | NaN | NaN |
| 154885 | NaN | NaN | NaN | NaN | NaN | NaN |
| 154886 | NaN | NaN | NaN | NaN | NaN | NaN |
149369 rows × 6 columns
section12_df.info()
<class 'pandas.core.frame.DataFrame'> Index: 149369 entries, 0 to 154886 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 liver_behavior 221 non-null float64 1 liver_grade 221 non-null float64 2 liver_morphology 221 non-null float64 3 liver_topography 221 non-null object 4 liver_seer 221 non-null float64 5 liver_seercat 221 non-null float64 dtypes: float64(5), object(1) memory usage: 8.0+ MB
daf.nulls_percentage(section12_df)
liver_behavior , 99.9% nulls , 1 unique values, float64 liver_grade , 99.9% nulls , 5 unique values, float64 liver_morphology , 99.9% nulls , 10 unique values, float64 liver_topography , 99.9% nulls , 2 unique values, object liver_seer , 99.9% nulls , 2 unique values, float64 liver_seercat , 99.9% nulls , 1 unique values, float64
section12_df.hist(figsize=(20,20), bins = 30, xrot=-45)
array([[<Axes: title={'center': 'liver_behavior'}>,
<Axes: title={'center': 'liver_grade'}>],
[<Axes: title={'center': 'liver_morphology'}>,
<Axes: title={'center': 'liver_seer'}>],
[<Axes: title={'center': 'liver_seercat'}>, <Axes: >]],
dtype=object)
Here we have the data about the cancer characteristics. For my initial study, I am not going to take into consideration these variables. The main reason, is because it is not the purpose of my study, which is, the existence of a liver cancer or not and death incidence.
Also, after reviewing those parameters and their characteristics, we can observe how, some of them, have an unique value, or the number of null values is very high, due to a low number of positive cases in my dataset.
Section 13: Mortality status¶
I am not going to study death or not, only the development of the disease.
section13_df = liver_cancer_df_with_bq[section13]
section13_df
| is_dead | is_dead_with_cod | dth_days | |
|---|---|---|---|
| 0 | 1 | 1 | 7939.0 |
| 1 | 0 | 0 | NaN |
| 2 | 0 | 0 | NaN |
| 3 | 0 | 0 | NaN |
| 4 | 0 | 0 | NaN |
| ... | ... | ... | ... |
| 154882 | 0 | 0 | NaN |
| 154883 | 1 | 1 | 7099.0 |
| 154884 | 1 | 1 | 2152.0 |
| 154885 | 1 | 1 | 4524.0 |
| 154886 | 1 | 1 | 1579.0 |
149369 rows × 3 columns
section13_df.info()
<class 'pandas.core.frame.DataFrame'> Index: 149369 entries, 0 to 154886 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 is_dead 149369 non-null int64 1 is_dead_with_cod 149369 non-null int64 2 dth_days 55867 non-null float64 dtypes: float64(1), int64(2) memory usage: 4.6 MB
daf.nulls_percentage(section13_df)
is_dead , 0.0% nulls , 2 unique values, int64 is_dead_with_cod , 0.0% nulls , 2 unique values, int64 dth_days , 62.6% nulls , 8776 unique values, float64
section13_df.hist(figsize=(20,20), bins = 30, xrot=-45)
array([[<Axes: title={'center': 'is_dead'}>,
<Axes: title={'center': 'is_dead_with_cod'}>],
[<Axes: title={'center': 'dth_days'}>, <Axes: >]], dtype=object)
plt.figure(figsize=(20,20))
sns.heatmap(liver_cancer_df_with_bq[section13].corr(), cmap='RdBu_r', annot = True)
<Axes: >
In this section about the mortality status, we have two variables high correlated and one with many different values with not an important information. For those reason the only one that I am gonna use for my model, will be 'is_dead_with_cod'.
Section 14: Death Certificate Cause of Death¶
section14_df = liver_cancer_df_with_bq[section14]
section14_df
| d_seer_death | d_cancersite | d_dth_liver | d_codeath_cat | |
|---|---|---|---|---|
| 0 | 50051.0 | 999.0 | 0 | 900.0 |
| 1 | NaN | NaN | 0 | NaN |
| 2 | NaN | NaN | 0 | NaN |
| 3 | NaN | NaN | 0 | NaN |
| 4 | NaN | NaN | 0 | NaN |
| ... | ... | ... | ... | ... |
| 154882 | NaN | NaN | 0 | NaN |
| 154883 | 60000.0 | 999.0 | 0 | 1000.0 |
| 154884 | 50060.0 | 999.0 | 0 | 200.0 |
| 154885 | 14.0 | 14.0 | 0 | 100.0 |
| 154886 | 50300.0 | 999.0 | 0 | 900.0 |
149369 rows × 4 columns
section14_df.info()
<class 'pandas.core.frame.DataFrame'> Index: 149369 entries, 0 to 154886 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 d_seer_death 55846 non-null float64 1 d_cancersite 55846 non-null float64 2 d_dth_liver 149369 non-null int64 3 d_codeath_cat 55846 non-null float64 dtypes: float64(3), int64(1) memory usage: 5.7 MB
daf.nulls_percentage(section14_df)
d_seer_death , 62.6% nulls , 77 unique values, float64 d_cancersite , 62.6% nulls , 20 unique values, float64 d_dth_liver , 0.0% nulls , 2 unique values, int64 d_codeath_cat , 62.6% nulls , 17 unique values, float64
section14_df.hist(figsize=(20,20), bins = 30, xrot=-45)
array([[<Axes: title={'center': 'd_seer_death'}>,
<Axes: title={'center': 'd_cancersite'}>],
[<Axes: title={'center': 'd_dth_liver'}>,
<Axes: title={'center': 'd_codeath_cat'}>]], dtype=object)
plt.figure(figsize=(20,20))
sns.heatmap(liver_cancer_df_with_bq[section14].corr(), cmap='RdBu_r', annot = True)
<Axes: >
Section 15: Final Cause of Death¶
section15_df = liver_cancer_df_with_bq[section15]
section15_df
| f_seer_death | f_cancersite | f_dth_liver | f_codeath_cat | |
|---|---|---|---|---|
| 0 | 50051.0 | 999.0 | 0 | 900.0 |
| 1 | NaN | NaN | 0 | NaN |
| 2 | NaN | NaN | 0 | NaN |
| 3 | NaN | NaN | 0 | NaN |
| 4 | NaN | NaN | 0 | NaN |
| ... | ... | ... | ... | ... |
| 154882 | NaN | NaN | 0 | NaN |
| 154883 | 60000.0 | 999.0 | 0 | 1000.0 |
| 154884 | 50060.0 | 999.0 | 0 | 200.0 |
| 154885 | 14.0 | 14.0 | 0 | 100.0 |
| 154886 | 50300.0 | 999.0 | 0 | 900.0 |
149369 rows × 4 columns
section15_df.info()
<class 'pandas.core.frame.DataFrame'> Index: 149369 entries, 0 to 154886 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 f_seer_death 55845 non-null float64 1 f_cancersite 55845 non-null float64 2 f_dth_liver 149369 non-null int64 3 f_codeath_cat 55845 non-null float64 dtypes: float64(3), int64(1) memory usage: 5.7 MB
daf.nulls_percentage(section15_df)
f_seer_death , 62.6% nulls , 77 unique values, float64 f_cancersite , 62.6% nulls , 20 unique values, float64 f_dth_liver , 0.0% nulls , 2 unique values, int64 f_codeath_cat , 62.6% nulls , 17 unique values, float64
section15_df.hist(figsize=(20,20), bins = 30, xrot=-45)
array([[<Axes: title={'center': 'f_seer_death'}>,
<Axes: title={'center': 'f_cancersite'}>],
[<Axes: title={'center': 'f_dth_liver'}>,
<Axes: title={'center': 'f_codeath_cat'}>]], dtype=object)
plt.figure(figsize=(20,20))
sns.heatmap(liver_cancer_df_with_bq[section15].corr(), cmap='RdBu_r', annot = True)
<Axes: >
Section 16: BQ Compliance¶
No voy a considerar ninguna accion con el formulario.
Section 17: BQ Demographics¶
section17_df = liver_cancer_df_with_bq[section17]
section17_df
| race7 | hispanic_f | educat | marital | occupat | |
|---|---|---|---|---|---|
| 0 | 2 | 0.0 | 2.0 | 3.0 | 4.0 |
| 1 | 1 | 0.0 | 7.0 | 1.0 | 2.0 |
| 2 | 1 | 0.0 | 5.0 | 1.0 | 4.0 |
| 3 | 1 | 0.0 | 5.0 | 1.0 | 1.0 |
| 4 | 1 | 0.0 | 6.0 | 3.0 | 2.0 |
| ... | ... | ... | ... | ... | ... |
| 154882 | 4 | 0.0 | 6.0 | 5.0 | 7.0 |
| 154883 | 1 | 0.0 | 6.0 | 1.0 | 4.0 |
| 154884 | 1 | 0.0 | 5.0 | 1.0 | 4.0 |
| 154885 | 1 | 0.0 | 3.0 | 1.0 | 4.0 |
| 154886 | 1 | 0.0 | 3.0 | 1.0 | 6.0 |
149369 rows × 5 columns
section17_df.info()
<class 'pandas.core.frame.DataFrame'> Index: 149369 entries, 0 to 154886 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 race7 149369 non-null int64 1 hispanic_f 145494 non-null float64 2 educat 148972 non-null float64 3 marital 148999 non-null float64 4 occupat 148620 non-null float64 dtypes: float64(4), int64(1) memory usage: 6.8 MB
daf.nulls_percentage(section17_df)
race7 , 0.0% nulls , 7 unique values, int64 hispanic_f , 2.6% nulls , 2 unique values, float64 educat , 0.3% nulls , 7 unique values, float64 marital , 0.2% nulls , 5 unique values, float64 occupat , 0.5% nulls , 7 unique values, float64
section17_df.hist(figsize=(20,20), bins = 30, xrot=-45)
array([[<Axes: title={'center': 'race7'}>,
<Axes: title={'center': 'hispanic_f'}>],
[<Axes: title={'center': 'educat'}>,
<Axes: title={'center': 'marital'}>],
[<Axes: title={'center': 'occupat'}>, <Axes: >]], dtype=object)
plt.figure(figsize=(20,20))
sns.heatmap(liver_cancer_df_with_bq[section17].corr(), cmap='RdBu_r', annot = True)
<Axes: >
Section 18: BQ Smoking¶
section18_df = liver_cancer_df_with_bq[section18]
section18_df
| cig_stat | cig_stop | cig_years | cigpd_f | pack_years | cigar | filtered_f | pipe | rsmoker_f | smokea_f | smoked_f | ssmokea_f | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0 | NaN | 0.0 | 0.0 | 0.0 | 0.0 | NaN | 0.0 | NaN | NaN | 0.0 | NaN |
| 1 | 2.0 | 33.0 | 8.0 | 5.0 | 24.0 | 2.0 | 1.0 | 2.0 | 0.0 | 21.0 | 1.0 | 29.0 |
| 2 | 2.0 | 13.0 | 35.0 | 2.0 | 35.0 | 0.0 | 2.0 | 0.0 | 0.0 | 15.0 | 1.0 | 50.0 |
| 3 | 0.0 | NaN | 0.0 | 0.0 | 0.0 | 0.0 | NaN | 0.0 | NaN | NaN | 0.0 | NaN |
| 4 | 2.0 | 25.0 | 20.0 | 1.0 | 10.0 | 0.0 | 1.0 | 0.0 | 0.0 | 18.0 | 1.0 | 38.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 154882 | 0.0 | NaN | 0.0 | 0.0 | 0.0 | 0.0 | NaN | 0.0 | NaN | NaN | 0.0 | NaN |
| 154883 | 0.0 | NaN | 0.0 | 0.0 | 0.0 | 0.0 | NaN | 0.0 | NaN | NaN | 0.0 | NaN |
| 154884 | 2.0 | 14.0 | 42.0 | 3.0 | 63.0 | 0.0 | 2.0 | 0.0 | 0.0 | 16.0 | 1.0 | 58.0 |
| 154885 | 1.0 | 0.0 | 42.0 | 2.0 | 42.0 | 0.0 | 1.0 | 0.0 | 1.0 | 20.0 | 1.0 | NaN |
| 154886 | 0.0 | NaN | 0.0 | 0.0 | 0.0 | 0.0 | NaN | 0.0 | NaN | NaN | 0.0 | NaN |
149369 rows × 12 columns
section18_df.info()
<class 'pandas.core.frame.DataFrame'> Index: 149369 entries, 0 to 154886 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 cig_stat 149348 non-null float64 1 cig_stop 79101 non-null float64 2 cig_years 147688 non-null float64 3 cigpd_f 149155 non-null float64 4 pack_years 147543 non-null float64 5 cigar 147635 non-null float64 6 filtered_f 80182 non-null float64 7 pipe 147795 non-null float64 8 rsmoker_f 80344 non-null float64 9 smokea_f 79855 non-null float64 10 smoked_f 149350 non-null float64 11 ssmokea_f 63153 non-null float64 dtypes: float64(12) memory usage: 14.8 MB
daf.nulls_percentage(section18_df)
cig_stat , 0.0% nulls , 3 unique values, float64 cig_stop , 47.0% nulls , 63 unique values, float64 cig_years , 1.1% nulls , 66 unique values, float64 cigpd_f , 0.1% nulls , 8 unique values, float64 pack_years , 1.2% nulls , 220 unique values, float64 cigar , 1.2% nulls , 3 unique values, float64 filtered_f , 46.3% nulls , 3 unique values, float64 pipe , 1.1% nulls , 3 unique values, float64 rsmoker_f , 46.2% nulls , 2 unique values, float64 smokea_f , 46.5% nulls , 63 unique values, float64 smoked_f , 0.0% nulls , 2 unique values, float64 ssmokea_f , 57.7% nulls , 67 unique values, float64
section18_df.hist(figsize=(20,20), bins = 30, xrot=-45)
array([[<Axes: title={'center': 'cig_stat'}>,
<Axes: title={'center': 'cig_stop'}>,
<Axes: title={'center': 'cig_years'}>],
[<Axes: title={'center': 'cigpd_f'}>,
<Axes: title={'center': 'pack_years'}>,
<Axes: title={'center': 'cigar'}>],
[<Axes: title={'center': 'filtered_f'}>,
<Axes: title={'center': 'pipe'}>,
<Axes: title={'center': 'rsmoker_f'}>],
[<Axes: title={'center': 'smokea_f'}>,
<Axes: title={'center': 'smoked_f'}>,
<Axes: title={'center': 'ssmokea_f'}>]], dtype=object)
plt.figure(figsize=(20,20))
sns.heatmap(liver_cancer_df_with_bq[section18].corr(), cmap='RdBu_r', annot = True)
<Axes: >
Section 19: BQ Family History¶
section19_df = liver_cancer_df_with_bq[section19]
section19_df
| fh_cancer | liver_fh | liver_fh_age | liver_fh_cnt | brothers | sisters | |
|---|---|---|---|---|---|---|
| 0 | 1.0 | 9.0 | NaN | 0.0 | 3.0 | 3.0 |
| 1 | 1.0 | 0.0 | NaN | 0.0 | 2.0 | 0.0 |
| 2 | 1.0 | 0.0 | NaN | 0.0 | 2.0 | 1.0 |
| 3 | 0.0 | 0.0 | NaN | 0.0 | 0.0 | 0.0 |
| 4 | 1.0 | 0.0 | NaN | 0.0 | 0.0 | 2.0 |
| ... | ... | ... | ... | ... | ... | ... |
| 154882 | 1.0 | 0.0 | NaN | 0.0 | 0.0 | 1.0 |
| 154883 | 1.0 | 0.0 | NaN | 0.0 | 2.0 | 2.0 |
| 154884 | 1.0 | 0.0 | NaN | 0.0 | 6.0 | 2.0 |
| 154885 | 0.0 | 0.0 | NaN | 0.0 | 4.0 | 1.0 |
| 154886 | 1.0 | 0.0 | NaN | 0.0 | 2.0 | 5.0 |
149369 rows × 6 columns
section19_df.info()
<class 'pandas.core.frame.DataFrame'> Index: 149369 entries, 0 to 154886 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 fh_cancer 148867 non-null float64 1 liver_fh 148210 non-null float64 2 liver_fh_age 2997 non-null float64 3 liver_fh_cnt 148210 non-null float64 4 brothers 148246 non-null float64 5 sisters 147898 non-null float64 dtypes: float64(6) memory usage: 8.0 MB
daf.nulls_percentage(section19_df)
fh_cancer , 0.3% nulls , 2 unique values, float64 liver_fh , 0.8% nulls , 3 unique values, float64 liver_fh_age , 98.0% nulls , 87 unique values, float64 liver_fh_cnt , 0.8% nulls , 4 unique values, float64 brothers , 0.8% nulls , 8 unique values, float64 sisters , 1.0% nulls , 8 unique values, float64
section19_df.hist(figsize=(20,20), bins = 30, xrot=-45)
array([[<Axes: title={'center': 'fh_cancer'}>,
<Axes: title={'center': 'liver_fh'}>],
[<Axes: title={'center': 'liver_fh_age'}>,
<Axes: title={'center': 'liver_fh_cnt'}>],
[<Axes: title={'center': 'brothers'}>,
<Axes: title={'center': 'sisters'}>]], dtype=object)
plt.figure(figsize=(20,20))
sns.heatmap(liver_cancer_df_with_bq[section19].corr(), cmap='RdBu_r', annot = True)
<Axes: >
Section 20: BQ Body Type¶
section20_df = liver_cancer_df_with_bq[section20]
section20_df
| bmi_curc | bmi_curr | height_f | weight_f | bmi_20 | bmi_20c | weight20_f | bmi_50 | bmi_50c | weight50_f | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3.0 | 25.724339 | 73.0 | 195.0 | 26.383937 | 3.0 | 200.0 | 29.022331 | 3.0 | 220.0 |
| 1 | 3.0 | 25.659988 | 71.0 | 184.0 | 22.313033 | 2.0 | 160.0 | 27.891291 | 3.0 | 200.0 |
| 2 | 4.0 | 34.585201 | 71.0 | 248.0 | 30.680421 | 4.0 | 220.0 | 32.074985 | 4.0 | 230.0 |
| 3 | 2.0 | 22.312012 | 64.0 | 130.0 | 20.595703 | 2.0 | 120.0 | 22.312012 | 2.0 | 130.0 |
| 4 | 3.0 | 27.460938 | 64.0 | 160.0 | 18.879395 | 2.0 | 110.0 | 24.886475 | 2.0 | 145.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 154882 | 3.0 | 29.411224 | 70.0 | 205.0 | 25.107143 | 3.0 | 175.0 | 26.541837 | 3.0 | 185.0 |
| 154883 | 3.0 | 28.886246 | 68.0 | 190.0 | 20.524438 | 2.0 | 135.0 | 26.605753 | 3.0 | 175.0 |
| 154884 | 4.0 | 33.715306 | 70.0 | 235.0 | 21.520408 | 2.0 | 150.0 | 30.128571 | 4.0 | 210.0 |
| 154885 | 3.0 | 27.405881 | 67.0 | 175.0 | 20.985075 | 2.0 | 134.0 | 25.839831 | 3.0 | 165.0 |
| 154886 | 3.0 | 25.229736 | 64.0 | 147.0 | 22.655273 | 2.0 | 132.0 | 25.401367 | 3.0 | 148.0 |
149369 rows × 10 columns
section20_df.info()
<class 'pandas.core.frame.DataFrame'> Index: 149369 entries, 0 to 154886 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 bmi_curc 147078 non-null float64 1 bmi_curr 147078 non-null float64 2 height_f 148162 non-null float64 3 weight_f 147836 non-null float64 4 bmi_20 146618 non-null float64 5 bmi_20c 146618 non-null float64 6 weight20_f 147359 non-null float64 7 bmi_50 147236 non-null float64 8 bmi_50c 147236 non-null float64 9 weight50_f 147993 non-null float64 dtypes: float64(10) memory usage: 12.5 MB
daf.nulls_percentage(section20_df)
bmi_curc , 1.5% nulls , 4 unique values, float64 bmi_curr , 1.5% nulls , 3882 unique values, float64 height_f , 0.8% nulls , 37 unique values, float64 weight_f , 1.0% nulls , 289 unique values, float64 bmi_20 , 1.8% nulls , 2385 unique values, float64 bmi_20c , 1.8% nulls , 4 unique values, float64 weight20_f , 1.3% nulls , 219 unique values, float64 bmi_50 , 1.4% nulls , 3204 unique values, float64 bmi_50c , 1.4% nulls , 4 unique values, float64 weight50_f , 0.9% nulls , 271 unique values, float64
section20_df.hist(figsize=(20,20), bins = 30, xrot=-45)
array([[<Axes: title={'center': 'bmi_curc'}>,
<Axes: title={'center': 'bmi_curr'}>,
<Axes: title={'center': 'height_f'}>],
[<Axes: title={'center': 'weight_f'}>,
<Axes: title={'center': 'bmi_20'}>,
<Axes: title={'center': 'bmi_20c'}>],
[<Axes: title={'center': 'weight20_f'}>,
<Axes: title={'center': 'bmi_50'}>,
<Axes: title={'center': 'bmi_50c'}>],
[<Axes: title={'center': 'weight50_f'}>, <Axes: >, <Axes: >]],
dtype=object)
plt.figure(figsize=(20,20))
sns.heatmap(liver_cancer_df_with_bq[section20].corr(), cmap='RdBu_r', annot = True)
<Axes: >
Section 21: BQ NSAIDS¶
section21_df = liver_cancer_df_with_bq[section21]
section21_df
| asp | asppd | ibup | ibuppd | |
|---|---|---|---|---|
| 0 | 1.0 | 4.0 | 0.0 | 0.0 |
| 1 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 1.0 | 2.0 | 0.0 | 0.0 |
| 3 | 0.0 | 0.0 | 0.0 | 0.0 |
| 4 | 0.0 | 0.0 | 0.0 | 0.0 |
| ... | ... | ... | ... | ... |
| 154882 | 0.0 | 0.0 | 0.0 | 0.0 |
| 154883 | 0.0 | 0.0 | 0.0 | 0.0 |
| 154884 | 0.0 | 0.0 | 0.0 | 0.0 |
| 154885 | 1.0 | 1.0 | 1.0 | 2.0 |
| 154886 | 0.0 | 0.0 | 1.0 | 7.0 |
149369 rows × 4 columns
section21_df.info()
<class 'pandas.core.frame.DataFrame'> Index: 149369 entries, 0 to 154886 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 asp 148517 non-null float64 1 asppd 148839 non-null float64 2 ibup 148545 non-null float64 3 ibuppd 148365 non-null float64 dtypes: float64(4) memory usage: 5.7 MB
daf.nulls_percentage(section21_df)
asp , 0.6% nulls , 2 unique values, float64 asppd , 0.4% nulls , 8 unique values, float64 ibup , 0.6% nulls , 2 unique values, float64 ibuppd , 0.7% nulls , 8 unique values, float64
section21_df.hist(figsize=(20,20), bins = 30, xrot=-45)
array([[<Axes: title={'center': 'asp'}>,
<Axes: title={'center': 'asppd'}>],
[<Axes: title={'center': 'ibup'}>,
<Axes: title={'center': 'ibuppd'}>]], dtype=object)
plt.figure(figsize=(20,20))
sns.heatmap(liver_cancer_df_with_bq[section21].corr(), cmap='RdBu_r', annot = True)
<Axes: >
Section 22: BQ Diseases¶
section22_df = liver_cancer_df_with_bq[section22]
section22_df
| arthrit_f | bronchit_f | colon_comorbidity | diabetes_f | divertic_f | emphys_f | gallblad_f | hearta_f | hyperten_f | liver_comorbidity | osteopor_f | polyps_f | stroke_f | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 4 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 154882 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 154883 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 154884 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 154885 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 |
| 154886 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
149369 rows × 13 columns
section22_df.info()
<class 'pandas.core.frame.DataFrame'> Index: 149369 entries, 0 to 154886 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 arthrit_f 148362 non-null float64 1 bronchit_f 148360 non-null float64 2 colon_comorbidity 147815 non-null float64 3 diabetes_f 148442 non-null float64 4 divertic_f 148234 non-null float64 5 emphys_f 148435 non-null float64 6 gallblad_f 148288 non-null float64 7 hearta_f 148397 non-null float64 8 hyperten_f 148494 non-null float64 9 liver_comorbidity 148255 non-null float64 10 osteopor_f 148182 non-null float64 11 polyps_f 148270 non-null float64 12 stroke_f 148441 non-null float64 dtypes: float64(13) memory usage: 16.0 MB
daf.nulls_percentage(section22_df)
arthrit_f , 0.7% nulls , 2 unique values, float64 bronchit_f , 0.7% nulls , 2 unique values, float64 colon_comorbidity , 1.0% nulls , 2 unique values, float64 diabetes_f , 0.6% nulls , 2 unique values, float64 divertic_f , 0.8% nulls , 2 unique values, float64 emphys_f , 0.6% nulls , 2 unique values, float64 gallblad_f , 0.7% nulls , 2 unique values, float64 hearta_f , 0.7% nulls , 2 unique values, float64 hyperten_f , 0.6% nulls , 2 unique values, float64 liver_comorbidity , 0.7% nulls , 2 unique values, float64 osteopor_f , 0.8% nulls , 2 unique values, float64 polyps_f , 0.7% nulls , 2 unique values, float64 stroke_f , 0.6% nulls , 2 unique values, float64
section22_df.hist(figsize=(20,20), bins = 30, xrot=-45)
array([[<Axes: title={'center': 'arthrit_f'}>,
<Axes: title={'center': 'bronchit_f'}>,
<Axes: title={'center': 'colon_comorbidity'}>,
<Axes: title={'center': 'diabetes_f'}>],
[<Axes: title={'center': 'divertic_f'}>,
<Axes: title={'center': 'emphys_f'}>,
<Axes: title={'center': 'gallblad_f'}>,
<Axes: title={'center': 'hearta_f'}>],
[<Axes: title={'center': 'hyperten_f'}>,
<Axes: title={'center': 'liver_comorbidity'}>,
<Axes: title={'center': 'osteopor_f'}>,
<Axes: title={'center': 'polyps_f'}>],
[<Axes: title={'center': 'stroke_f'}>, <Axes: >, <Axes: >,
<Axes: >]], dtype=object)
plt.figure(figsize=(20,20))
sns.heatmap(liver_cancer_df_with_bq[section22].corr(), cmap='RdBu_r', annot = True)
<Axes: >
Section 23: BQ Female Specific¶
section23_df = liver_cancer_df_with_bq[section23]
section23_df
| hyster_f | hystera | ovariesr_f | tuballig | bcontr_f | bcontra | bcontrt | curhorm | horm_f | horm_stat | thorm | fchilda | livec | miscar | preg_f | prega | pregc | stillb | trypreg | tubal | fmenstr | lmenstr | menstrs | menstrs_stat_type | post_menopausal | bbd | benign_ovcyst | endometriosis | uterine_fib | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | 1.0 | 3.0 | 0.0 | 0.0 | 0.0 | NaN | 0.0 | 1.0 | 1.0 | 1.0 | 1.0 | 3.0 | 4.0 | 2.0 | 1.0 | 3.0 | 4.0 | 0.0 | 0.0 | 0.0 | 3.0 | 3.0 | 2.0 | 3.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 4 | 0.0 | NaN | 0.0 | 1.0 | 1.0 | 1.0 | 2.0 | 1.0 | 1.0 | 1.0 | 2.0 | 3.0 | 3.0 | 1.0 | 1.0 | 3.0 | 3.0 | 0.0 | 0.0 | 0.0 | 3.0 | 4.0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 154882 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 154883 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 154884 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 154885 | 1.0 | 2.0 | 0.0 | 1.0 | 1.0 | 2.0 | 3.0 | 1.0 | 1.0 | 1.0 | 1.0 | 3.0 | 4.0 | 0.0 | 1.0 | 3.0 | 3.0 | 0.0 | 0.0 | 0.0 | 5.0 | 2.0 | 2.0 | 3.0 | 1.0 | 0.0 | NaN | 0.0 | 1.0 |
| 154886 | 0.0 | NaN | 0.0 | 0.0 | 0.0 | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | 5.0 | 1.0 | 1.0 | 3.0 | 5.0 | 0.0 | 0.0 | 0.0 | 3.0 | 3.0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 |
149369 rows × 29 columns
section23_df.info()
<class 'pandas.core.frame.DataFrame'> Index: 149369 entries, 0 to 154886 Data columns (total 29 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 hyster_f 75695 non-null float64 1 hystera 27371 non-null float64 2 ovariesr_f 75788 non-null float64 3 tuballig 75500 non-null float64 4 bcontr_f 75677 non-null float64 5 bcontra 40929 non-null float64 6 bcontrt 75581 non-null float64 7 curhorm 75354 non-null float64 8 horm_f 75681 non-null float64 9 horm_stat 75681 non-null float64 10 thorm 75165 non-null float64 11 fchilda 68408 non-null float64 12 livec 75687 non-null float64 13 miscar 75541 non-null float64 14 preg_f 75790 non-null float64 15 prega 69891 non-null float64 16 pregc 75634 non-null float64 17 stillb 75245 non-null float64 18 trypreg 75475 non-null float64 19 tubal 75486 non-null float64 20 fmenstr 75603 non-null float64 21 lmenstr 75171 non-null float64 22 menstrs 74409 non-null float64 23 menstrs_stat_type 75807 non-null float64 24 post_menopausal 75807 non-null float64 25 bbd 73962 non-null float64 26 benign_ovcyst 72397 non-null float64 27 endometriosis 72098 non-null float64 28 uterine_fib 73066 non-null float64 dtypes: float64(29) memory usage: 34.2 MB
daf.nulls_percentage(section23_df)
hyster_f , 49.3% nulls , 3 unique values, float64 hystera , 81.7% nulls , 5 unique values, float64 ovariesr_f , 49.3% nulls , 7 unique values, float64 tuballig , 49.5% nulls , 3 unique values, float64 bcontr_f , 49.3% nulls , 2 unique values, float64 bcontra , 72.6% nulls , 4 unique values, float64 bcontrt , 49.4% nulls , 6 unique values, float64 curhorm , 49.6% nulls , 2 unique values, float64 horm_f , 49.3% nulls , 3 unique values, float64 horm_stat , 49.3% nulls , 5 unique values, float64 thorm , 49.7% nulls , 6 unique values, float64 fchilda , 54.2% nulls , 7 unique values, float64 livec , 49.3% nulls , 6 unique values, float64 miscar , 49.4% nulls , 3 unique values, float64 preg_f , 49.3% nulls , 3 unique values, float64 prega , 53.2% nulls , 7 unique values, float64 pregc , 49.4% nulls , 6 unique values, float64 stillb , 49.6% nulls , 3 unique values, float64 trypreg , 49.5% nulls , 2 unique values, float64 tubal , 49.5% nulls , 3 unique values, float64 fmenstr , 49.4% nulls , 5 unique values, float64 lmenstr , 49.7% nulls , 5 unique values, float64 menstrs , 50.2% nulls , 4 unique values, float64 menstrs_stat_type , 49.2% nulls , 8 unique values, float64 post_menopausal , 49.2% nulls , 2 unique values, float64 bbd , 50.5% nulls , 2 unique values, float64 benign_ovcyst , 51.5% nulls , 2 unique values, float64 endometriosis , 51.7% nulls , 2 unique values, float64 uterine_fib , 51.1% nulls , 2 unique values, float64
Here we can see how many nulls or missing values we have in this section. One of the problems is, since is a section with characteristics only presented in women, every single male has an NaN by default. I am going to proceed to assing to every male, 0 as default value.
daf.set_gender_characteristics_value(liver_cancer_df_with_bq, section23, 1)
section23_df_with_default_values_for_the_other_gender = liver_cancer_df_with_bq[section23]
daf.nulls_percentage(section23_df_with_default_values_for_the_other_gender)
hyster_f , 0.1% nulls , 3 unique values, float64 hystera , 32.4% nulls , 6 unique values, float64 ovariesr_f , 0.0% nulls , 7 unique values, float64 tuballig , 0.2% nulls , 3 unique values, float64 bcontr_f , 0.1% nulls , 2 unique values, float64 bcontra , 23.4% nulls , 5 unique values, float64 bcontrt , 0.2% nulls , 6 unique values, float64 curhorm , 0.3% nulls , 2 unique values, float64 horm_f , 0.1% nulls , 3 unique values, float64 horm_stat , 0.1% nulls , 5 unique values, float64 thorm , 0.4% nulls , 6 unique values, float64 fchilda , 5.0% nulls , 8 unique values, float64 livec , 0.1% nulls , 6 unique values, float64 miscar , 0.2% nulls , 3 unique values, float64 preg_f , 0.0% nulls , 3 unique values, float64 prega , 4.0% nulls , 8 unique values, float64 pregc , 0.1% nulls , 6 unique values, float64 stillb , 0.4% nulls , 3 unique values, float64 trypreg , 0.2% nulls , 2 unique values, float64 tubal , 0.2% nulls , 3 unique values, float64 fmenstr , 0.1% nulls , 6 unique values, float64 lmenstr , 0.4% nulls , 6 unique values, float64 menstrs , 0.9% nulls , 5 unique values, float64 menstrs_stat_type , 0.0% nulls , 9 unique values, float64 post_menopausal , 0.0% nulls , 3 unique values, float64 bbd , 1.2% nulls , 2 unique values, float64 benign_ovcyst , 2.3% nulls , 2 unique values, float64 endometriosis , 2.5% nulls , 2 unique values, float64 uterine_fib , 1.8% nulls , 2 unique values, float64
We can observe how our percentage of nulls and missings decreased
section23_df_with_default_values_for_the_other_gender.hist(figsize=(20,20), bins = 30, xrot=-45)
array([[<Axes: title={'center': 'hyster_f'}>,
<Axes: title={'center': 'hystera'}>,
<Axes: title={'center': 'ovariesr_f'}>,
<Axes: title={'center': 'tuballig'}>,
<Axes: title={'center': 'bcontr_f'}>],
[<Axes: title={'center': 'bcontra'}>,
<Axes: title={'center': 'bcontrt'}>,
<Axes: title={'center': 'curhorm'}>,
<Axes: title={'center': 'horm_f'}>,
<Axes: title={'center': 'horm_stat'}>],
[<Axes: title={'center': 'thorm'}>,
<Axes: title={'center': 'fchilda'}>,
<Axes: title={'center': 'livec'}>,
<Axes: title={'center': 'miscar'}>,
<Axes: title={'center': 'preg_f'}>],
[<Axes: title={'center': 'prega'}>,
<Axes: title={'center': 'pregc'}>,
<Axes: title={'center': 'stillb'}>,
<Axes: title={'center': 'trypreg'}>,
<Axes: title={'center': 'tubal'}>],
[<Axes: title={'center': 'fmenstr'}>,
<Axes: title={'center': 'lmenstr'}>,
<Axes: title={'center': 'menstrs'}>,
<Axes: title={'center': 'menstrs_stat_type'}>,
<Axes: title={'center': 'post_menopausal'}>],
[<Axes: title={'center': 'bbd'}>,
<Axes: title={'center': 'benign_ovcyst'}>,
<Axes: title={'center': 'endometriosis'}>,
<Axes: title={'center': 'uterine_fib'}>, <Axes: >]], dtype=object)
plt.figure(figsize=(20,20))
sns.heatmap(liver_cancer_df_with_bq[section23].corr(), cmap='RdBu_r', annot = True)
<Axes: >
Section 24: BQ Male Specific¶
section24_df = liver_cancer_df_with_bq[section24]
section24_df
| enlpros_f | enlprosa | infpros_f | infprosa | prosprob_f | urinate_f | urinatea | |
|---|---|---|---|---|---|---|---|
| 0 | 0.0 | NaN | 0.0 | NaN | 0.0 | 1.0 | NaN |
| 1 | 0.0 | NaN | 0.0 | NaN | 0.0 | 2.0 | 4.0 |
| 2 | 0.0 | NaN | 0.0 | NaN | 0.0 | 2.0 | 4.0 |
| 3 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 154882 | 0.0 | NaN | 0.0 | NaN | 0.0 | 1.0 | NaN |
| 154883 | 1.0 | 4.0 | 0.0 | NaN | 1.0 | 2.0 | 4.0 |
| 154884 | 1.0 | 5.0 | 0.0 | NaN | 1.0 | 3.0 | 4.0 |
| 154885 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 154886 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
149369 rows × 7 columns
section24_df.info()
<class 'pandas.core.frame.DataFrame'> Index: 149369 entries, 0 to 154886 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 enlpros_f 73434 non-null float64 1 enlprosa 15941 non-null float64 2 infpros_f 61368 non-null float64 3 infprosa 5161 non-null float64 4 prosprob_f 73405 non-null float64 5 urinate_f 73433 non-null float64 6 urinatea 25568 non-null float64 dtypes: float64(7) memory usage: 9.1 MB
daf.nulls_percentage(section24_df)
enlpros_f , 50.8% nulls , 2 unique values, float64 enlprosa , 89.3% nulls , 6 unique values, float64 infpros_f , 58.9% nulls , 2 unique values, float64 infprosa , 96.5% nulls , 6 unique values, float64 prosprob_f , 50.9% nulls , 2 unique values, float64 urinate_f , 50.8% nulls , 6 unique values, float64 urinatea , 82.9% nulls , 6 unique values, float64
Here we can see how many nulls or missing values we have in this section. One of the problems is, since is a section with characteristics only presented in men, every single woman has an NaN by default. I am going to proceed to assing to every female, 0 as default value.
daf.set_gender_characteristics_value(liver_cancer_df_with_bq, section24, 2)
section24_df_with_default_values_for_the_other_gender = liver_cancer_df_with_bq[section24]
daf.nulls_percentage(section24_df_with_default_values_for_the_other_gender)
enlpros_f , 0.1% nulls , 2 unique values, float64 enlprosa , 38.6% nulls , 7 unique values, float64 infpros_f , 8.2% nulls , 2 unique values, float64 infprosa , 45.8% nulls , 7 unique values, float64 prosprob_f , 0.1% nulls , 2 unique values, float64 urinate_f , 0.1% nulls , 6 unique values, float64 urinatea , 32.1% nulls , 7 unique values, float64
section24_df_with_default_values_for_the_other_gender.hist(figsize=(20,20), bins = 30, xrot=-45)
array([[<Axes: title={'center': 'enlpros_f'}>,
<Axes: title={'center': 'enlprosa'}>,
<Axes: title={'center': 'infpros_f'}>],
[<Axes: title={'center': 'infprosa'}>,
<Axes: title={'center': 'prosprob_f'}>,
<Axes: title={'center': 'urinate_f'}>],
[<Axes: title={'center': 'urinatea'}>, <Axes: >, <Axes: >]],
dtype=object)
plt.figure(figsize=(20,20))
sns.heatmap(liver_cancer_df_with_bq[section24].corr(), cmap='RdBu_r', annot = True)
<Axes: >
Section 25: BQ Prostate Surgery¶
section25_df = liver_cancer_df_with_bq[section25]
section25_df
| surg_age | surg_any | surg_biopsy | surg_prostatectomy | surg_resection | vasect_f | vasecta | |
|---|---|---|---|---|---|---|---|
| 0 | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | NaN |
| 1 | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | NaN |
| 2 | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 2.0 |
| 3 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 154882 | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | NaN |
| 154883 | 3.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | NaN |
| 154884 | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | NaN |
| 154885 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 154886 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
149369 rows × 7 columns
section25_df.info()
<class 'pandas.core.frame.DataFrame'> Index: 149369 entries, 0 to 154886 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 surg_age 5693 non-null float64 1 surg_any 73525 non-null float64 2 surg_biopsy 71442 non-null float64 3 surg_prostatectomy 71210 non-null float64 4 surg_resection 71252 non-null float64 5 vasect_f 73310 non-null float64 6 vasecta 19895 non-null float64 dtypes: float64(7) memory usage: 9.1 MB
daf.nulls_percentage(section25_df)
surg_age , 96.2% nulls , 5 unique values, float64 surg_any , 50.8% nulls , 3 unique values, float64 surg_biopsy , 52.2% nulls , 2 unique values, float64 surg_prostatectomy , 52.3% nulls , 2 unique values, float64 surg_resection , 52.3% nulls , 2 unique values, float64 vasect_f , 50.9% nulls , 2 unique values, float64 vasecta , 86.7% nulls , 4 unique values, float64
Here we can see how many nulls or missing values we have in this section. One of the problems is, since is a section with characteristics only presented in men, every single woman has an NaN by default. I am going to proceed to assing to every female, 0 as default value.
daf.set_gender_characteristics_value(liver_cancer_df_with_bq, section25, 2)
section25_df_with_default_values_for_the_other_gender = liver_cancer_df_with_bq[section25]
daf.nulls_percentage(section25_df_with_default_values_for_the_other_gender)
surg_age , 45.4% nulls , 6 unique values, float64 surg_any , 0.0% nulls , 3 unique values, float64 surg_biopsy , 1.4% nulls , 2 unique values, float64 surg_prostatectomy , 1.6% nulls , 2 unique values, float64 surg_resection , 1.5% nulls , 2 unique values, float64 vasect_f , 0.2% nulls , 2 unique values, float64 vasecta , 35.9% nulls , 5 unique values, float64
section25_df_with_default_values_for_the_other_gender.hist(figsize=(20,20), bins = 30, xrot=-45)
array([[<Axes: title={'center': 'surg_age'}>,
<Axes: title={'center': 'surg_any'}>,
<Axes: title={'center': 'surg_biopsy'}>],
[<Axes: title={'center': 'surg_prostatectomy'}>,
<Axes: title={'center': 'surg_resection'}>,
<Axes: title={'center': 'vasect_f'}>],
[<Axes: title={'center': 'vasecta'}>, <Axes: >, <Axes: >]],
dtype=object)
plt.figure(figsize=(20,20))
sns.heatmap(liver_cancer_df_with_bq[section25].corr(), cmap='RdBu_r', annot = True)
<Axes: >
Variables selection¶
Feature selection is a critical step in building robust Machine Learning models. In this section, we refine the dataset by removing irrelevant, redundant, or highly correlated features that do not contribute significantly to liver cancer prediction.
The key aspects of this process include:
- Filtering out variables with low variance or excessive missing values.
- Removing highly correlated features to prevent multicollinearity.
- Keeping only relevant predictors based on domain knowledge and exploratory analysis.
By selecting the most meaningful variables, we enhance model interpretability and efficiency, ultimately improving predictive performance.
# Variables we are not going to consider for our model
variables_to_delete_set = {'build', 'build_cancers', 'build_death_cutoff', 'build_incidence_cutoff',
'liver_eligible_bq', 'entryage_bq', 'entrydays_bq', 'ph_liver_bq', 'ph_any_bq',
'liver_eligible_dhq', 'entryage_dhq', 'entrydays_dhq', 'ph_liver_dhq', 'ph_any_dhq',
'liver_eligible_dqx', 'entryage_dqx', 'entrydays_dqx', 'ph_liver_dqx', 'ph_any_dqx',
'liver_eligible_sqx', 'entryage_sqx', 'entrydays_sqx', 'ph_liver_sqx', 'ph_any_sqx',
'entryage_muq', 'entrydays_muq', 'ph_liver_muq', 'ph_any_muq',
'fstcan_exitstat', 'fstcan_exitdays', 'liver_exitdays', 'fstcan_exitage', 'mortality_exitdays',
'age',
'reconsent_outcome', 'reconsent_outcome_days',
'liver_cancer_diagdays', 'liver_annyr', 'liver_cancer_first',
'liver_behavior', 'liver_grade', 'liver_morphology', 'liver_topography', 'liver_seer', 'liver_seercat',
'is_dead', 'is_dead_with_cod', 'dth_days',
'd_seer_death', 'd_cancersite', 'd_dth_liver', 'd_codeath_cat',
'f_seer_death', 'f_cancersite', 'f_dth_liver', 'f_codeath_cat',
'bq_returned', 'bq_age', 'bq_compdays', 'bq_adminm',
'hispanic_f', 'educat', 'marital', 'occupat',
'cig_stop', 'cig_years', 'cigpd_f', 'pack_years', 'smoked_f', 'rsmoker_f',
'liver_fh_age',
'bmi_curr', 'weight_f', 'bmi_20', 'weight20_f', 'bmi_50', 'weight50_f',
'asp', 'ibup',
'hystera', 'ovariesr_f', 'bcontra', 'bcontrt', 'curhorm', 'horm_stat', 'thorm', 'fchilda', 'livec', 'prega', 'pregc', 'lmenstr', 'menstrs_stat_type', 'post_menopausal',
'enlprosa', 'infprosa', 'prosprob_f', 'urinatea',
'surg_age', 'surg_biopsy', 'surg_resection', 'vasecta',
'plco_id'}
len(variables_to_delete_set)
102
liver_cancer_df_with_bq_copy = liver_cancer_df_with_bq.copy()
liver_cancer_df_with_bq_final = liver_cancer_df_with_bq_copy.drop(columns=variables_to_delete_set)
liver_cancer_df_with_bq_final.shape
(149369, 65)
liver_cancer_df_with_bq_final_with_target = liver_cancer_df_with_bq_copy.drop(columns=variables_to_delete_set - {'liver_cancer'})
liver_cancer_df_with_bq_final_with_target.shape
(149369, 65)
daf.nulls_percentage(liver_cancer_df_with_bq_final)
liver_cancer , 0.0% nulls , 2 unique values, int64 liver_exitstat , 0.0% nulls , 8 unique values, int64 liver_exitage , 0.0% nulls , 37 unique values, int64 pipe , 1.1% nulls , 3 unique values, float64 cigar , 1.2% nulls , 3 unique values, float64 sisters , 1.0% nulls , 8 unique values, float64 brothers , 0.8% nulls , 8 unique values, float64 fmenstr , 0.1% nulls , 6 unique values, float64 menstrs , 0.9% nulls , 5 unique values, float64 miscar , 0.2% nulls , 3 unique values, float64 tubal , 0.2% nulls , 3 unique values, float64 tuballig , 0.2% nulls , 3 unique values, float64 bbd , 1.2% nulls , 2 unique values, float64 benign_ovcyst , 2.3% nulls , 2 unique values, float64 endometriosis , 2.5% nulls , 2 unique values, float64 uterine_fib , 1.8% nulls , 2 unique values, float64 trypreg , 0.2% nulls , 2 unique values, float64 stillb , 0.4% nulls , 3 unique values, float64 asppd , 0.4% nulls , 8 unique values, float64 ibuppd , 0.7% nulls , 8 unique values, float64 hyperten_f , 0.6% nulls , 2 unique values, float64 hearta_f , 0.7% nulls , 2 unique values, float64 stroke_f , 0.6% nulls , 2 unique values, float64 emphys_f , 0.6% nulls , 2 unique values, float64 bronchit_f , 0.7% nulls , 2 unique values, float64 diabetes_f , 0.6% nulls , 2 unique values, float64 polyps_f , 0.7% nulls , 2 unique values, float64 arthrit_f , 0.7% nulls , 2 unique values, float64 osteopor_f , 0.8% nulls , 2 unique values, float64 divertic_f , 0.8% nulls , 2 unique values, float64 gallblad_f , 0.7% nulls , 2 unique values, float64 race7 , 0.0% nulls , 7 unique values, int64 surg_prostatectomy , 1.6% nulls , 2 unique values, float64 surg_any , 0.0% nulls , 3 unique values, float64 preg_f , 0.0% nulls , 3 unique values, float64 hyster_f , 0.1% nulls , 3 unique values, float64 enlpros_f , 0.1% nulls , 2 unique values, float64 infpros_f , 8.2% nulls , 2 unique values, float64 urinate_f , 0.1% nulls , 6 unique values, float64 vasect_f , 0.2% nulls , 2 unique values, float64 bcontr_f , 0.1% nulls , 2 unique values, float64 horm_f , 0.1% nulls , 3 unique values, float64 smokea_f , 46.5% nulls , 63 unique values, float64 ssmokea_f , 57.7% nulls , 67 unique values, float64 filtered_f , 46.3% nulls , 3 unique values, float64 cig_stat , 0.0% nulls , 3 unique values, float64 bmi_curc , 1.5% nulls , 4 unique values, float64 height_f , 0.8% nulls , 37 unique values, float64 bmi_20c , 1.8% nulls , 4 unique values, float64 bmi_50c , 1.4% nulls , 4 unique values, float64 colon_comorbidity , 1.0% nulls , 2 unique values, float64 liver_comorbidity , 0.7% nulls , 2 unique values, float64 fh_cancer , 0.3% nulls , 2 unique values, float64 liver_fh , 0.8% nulls , 3 unique values, float64 liver_fh_cnt , 0.8% nulls , 4 unique values, float64 mortality_exitage , 0.0% nulls , 45 unique values, int64 mortality_exitstat , 0.0% nulls , 4 unique values, int64 ph_any_trial , 0.0% nulls , 2 unique values, int64 ph_liver_trial , 0.0% nulls , 1 unique values, int64 center , 0.0% nulls , 10 unique values, int64 rndyear , 0.0% nulls , 9 unique values, int64 arm , 0.0% nulls , 2 unique values, int64 sex , 0.0% nulls , 2 unique values, int64 agelevel , 0.0% nulls , 4 unique values, int64 in_TGWAS_population , 0.0% nulls , 2 unique values, int64
liver_cancer_df_with_bq_final.to_csv("../../0. Data/1. Cleaned/liver_cancer_df_with_bq_final.csv", index=False)
liver_cancer_df_with_bq_final_with_target.to_csv("../../0. Data/1. Cleaned/liver_cancer_df_with_bq_final_with_target.csv", index=False)